Team Optical Oddballs:

•

•

•

Published December 1, 2023 © MIT

Guiding Gaze

A smart solution to an age-old problem for the blind.

AdvancedFull instructions providedOver 6 days1,808

Finalist

OpenCV AI Competition 2023

Popular Vote - 2nd Place

OpenCV AI Competition 2023

Popular Vote - Finalist

OpenCV AI Competition 2023

Things used in this project

Software apps and online services

OpenCV – Open Source Computer Vision Library OpenCV

TensorFlow

Jupyter Notebook

pytorch

map quest

Story

Motivation

In a small town away from the hustle of big cities, a group of young engineers set out on an unconventional mission. Inspired by empathy and driven by innovation, they embarked on a journey to craft an ingenious solution to aid the visually impaired in navigating the urban landscape independently. Thus, an idea was formed - An auto-navigation system designed to empower those with visual challenges.

The team, consisting of young and passionate undergrads from various disciplines, came together fuelled by a common purpose: to create a transformative tool that would transcend barriers and grant newfound freedom to individuals with visual impairments. Armed with empathy and a deep understanding of the users' needs, the team set to work. Countless brainstorming sessions, prototyping phases, and meticulous testing ensued. Guiding wasn't merely a technological endeavour; it was a labor of love and empathy.

We needed more than just a piece of hardware or software; We needed an ecosystem. Something that combined cutting-edge AI-driven algorithms, and user-friendly interfaces to create an intuitive system. It detects obstacles, identified landmarks, and guided the user through haptic feedback and audio cues. The system wasn't just about providing directions; it was about enhancing spatial awareness and fostering confidence. The users should feel empowered, moving through the city streets with a newfound sense of independence.

The team's dedication bore fruit and the sleepless nights paid off as they witnessed heartwarming moments of triumph. Testing was a success, the goal was achieved and Guiding Gaze was born into the world.

Report

Our Idea

We aim to empower visually impaired individuals to navigate footpaths, parks, and less frequented roads independently, minimizing the need for constant assistance. While our solution aims for complete autonomy, we advise the use of a walking stick as an added precaution to alert those nearby.

We propose a real-time navigation system for the visually impaired via a mobile application and wired communication between electronic components. Our solution deploys various models to detect, avoid and navigate through any environment in real time and provide a smooth and elegant user experience.

We use the DroidCam App to connect our android with our laptop with a USB data cable, making our phone serve as a webcam for our laptop. Utilizing our laptop CPU, we perform the following to guide the user:

Obstacle detection – Detecting incoming objects, humans, animals, vehicles, etc.
Depth Estimation – Calculating the distance of an object via the video received from the camera.
Scene Recognition – Using Deep Learning models to find the local environment in which the user is present. This can be a park, a footpath or their room.
Barrier Detection – Detecting walls, barriers or gates to prevent collision.
Facial Recognition – Using Deep Learning models to detect and recognise familiar faces. This assists the user in conversing and interacting with their close ones.
Navigation – Using all the above features to navigate through the environment, creating a traversable route to the user’s destination and assisting them in reaching it.

Our Setup

Product Architecture

Here's a concise breakdown of our project's key components:

Video Input: Live video streaming from our phone to the laptop for real-time processing.
Video Analysis: Frame-by-frame analysis via various Deep Learning models. Object identification, facial recognition, and scene analysis to gather path related information.
Depth Estimation: Utilising data from the analysis phase to determine the distance of objects from the user.
Navigation: Using insights from prior [1] to compute a safe and feasible path for the user.
Audio Guidance: Sending audio cues through the app to guide the user along the calculated path towards their destination.

Each stage contributes to a comprehensive system that processes live video, analyses the environment, estimates object distances, plans a safe route, and offers real-time audio guidance, ensuring a secure and efficient navigation experience for the user.

Obstacle Detection

YOLOv7, standing for "You Only Look Once Version 7, " stands out as a cutting-edge Deep Learning model, excelling in real-time object detection with exceptional speed. Built on the principle of processing each image or frame just once, its compact design enables real-time object detection using minimal computational resources.

YOLOv7: Architecture and Training Synopsis Grid-Based Prediction: Rooted in a neural network framework, YOLOv7 dissects input images into a grid, predicting bounding boxes and object class probabilities within each grid cell.
Enhanced Capabilities: YOLOv7 surpasses its predecessors by integrating distinctive features that elevate its prowess in real-time object detection and classification.
Application in Navigation: A pivotal component of our navigation system, YOLOv7 underwent training on diverse datasets like COCO and custom datasets encompassing obstacles commonly encountered in road and home environments.
Demonstrated Effectiveness: Visual demonstrations vividly illustrate YOLOv7's prowess in accurately classifying and pinpointing obstacles, showcasing its effectiveness in diverse scenarios.
By harnessing the power of YOLOv7 at the core of our navigation system, meticulously trained and fine-tuned for real-time obstacle detection, we've established a resilient solution. This solution enables swift and precise identification of obstacles, ensuring safety and reliability

YOLOv7 outputs a bounding box for every obstacle it detects, the bounding box is a list of 4 floating point numbers (x_min, y_min, x_max, y_max) of the rectangle as shown.

Code Snippet 1:

def detect(save_img=False):
k = 0
source, weights, view_img, save_txt, imgsz, trace = opt.source, opt.weights, opt.view_img, opt.save_txt, opt.img_size, not opt.no_trace
save_img = not opt.nosave and not source.endswith('.txt')
webcam = source.isnumeric() or source.endswith('.txt') or source.lower().startswith(
('rtsp://', 'rtmp://', 'http://', 'https://'))
save_dir = Path(increment_path(Path(opt.project) / opt.name, exist_ok=opt.exist_ok))
(save_dir / 'labels' if save_txt else save_dir).mkdir(parents=True, exist_ok=True)
set_logging()
frame_bounding_box_data = []
device = select_device(opt.device)
half = device.type != 'cpu'

Depth Estimation

MIDAS[2] Is a Deep Learning model designed to estimate depth or distance from a given image. It utilizes Convolutional Neural Networks (CNNs) to predict object distances by generating a depth map.

Our approach allows MIDAS to efficiently estimate depth from a single camera, leveraging the power of CNNs and specialized processing to provide accurate depth maps while reducing the computational cost.

MIDAS Architecture Overview:

Encoder-Decoder Framework: MIDAS employs a CNN architecture comprising an encoder and a decoder. The encoder processes input images, extracting features while reducing spatial dimensions, and creating an abstract data representation. The decoder, in turn, refines these extracted features, reconstructing any lost data and reconstructing a detailed depth map from the abstracted features.
Optimization Techniques: To enhance performance, MIDAS utilizes optimization methods like Adam Optimizers and mathematical functions such as the Cross-Entropy Loss Function.
Depth Map Normalization: MIDAS generates a spline array to create a normalized, continuous representation of the depth map. Using the previous module's output, it calculates the obstacle's centre and utilizes this coordinate within the spline array to compute the depth.

Image and Depth map

Code Snippet 2:

def depth_to_distance(depth_value, depth_scale):
    return 1.0 / (depth_value * depth_scale) 
h, w = output_norm.shape x_grid = np.arange(w) y_grid = np.arange(h) spline = RectBivariateSpline(y_grid, x_grid, output_norm) depth_scale = 1
depth_mid_filt = spline(0, 0) # y, x
depth_midas = depth_to_distance(depth_mid_filt, depth_scale)

Scene Recognition

Scene Recognition provides us with a more comprehensive and contextualised understanding of the local environment, and simplifies the diverse challenges that are faced in different outdoor and indoor conditions by the visually impaired. Scene Recognition is done by employing Efficientnetb2[3], a Convolutional Neural Network(CNN) based Deep Learning model.

The different scenes we considered include a balcony, bathroom, bedroom, boardwalk, garage, highway, kitchen, lake, market, site, street, and train.

EfficientNetb2 Architecture Overview:

Compound Scaling: EfficientNetB2 leverages compound scaling, a method that uniformly scales the network's depth, width, and resolution with a compound coefficient. This approach ensures that each architectural dimension is optimized simultaneously, balancing model complexity and performance.
Efficient Blocks (MBConv): It incorporates Mobile Inverted Bottleneck Convolutional (MBConv) blocks, which consist of depthwise separable convolutions with expansion and squeeze-andexcitation (SE) operations. These blocks efficiently capture complex patterns while minimizing computational overhead, crucial for resource-constrained environments.
Efficiency and Performance Balance: We strike a balance between model efficiency (reducing parameters and computations) and high performance across tasks such as image classification and object detection. This balance is crucial for achieving state-of-the-art results while being resourceefficient, making it suitable for a wide range of applications.

EfficientNetB2's design focuses on compound scaling, efficient block usage, and feature integration to ensure strong performance and resource efficiency in various computer vision tasks. It makes real-time processing possible without the use of GPUs.

Code Snippet 3:

from tensorflow.keras.layers import Dense,Activation,Dropout,Conv2D,MaxPooling2D,BatchNormalization,Flatten
from tensorflow.keras import regularizers
from tensorflow.keras.models import Model, load_model, Sequential
model_name='EfficientNetB2'
base_model=tf.keras.applications.EfficientNetB2(include_top=False, weights="imagenet",input_shape=(224,224,3), pooling='max') 
x=base_model.output
#x=layers.GlobalAvgPool2D(name = "pooling_layer")(x)
x=tf.keras.layers.BatchNormalization(axis=-1, momentum=0.99, epsilon=0.001 )(x)
x=Dense(256, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
                bias_regularizer=regularizers.l1(0.006) ,activation='relu')(x)
x=Dense(128, kernel_regularizer = regularizers.l2(l = 0.016),activity_regularizer=regularizers.l1(0.006),
                bias_regularizer=regularizers.l1(0.006) ,activation='relu')(x)
x=Dropout(rate=.45, seed=42)(x)        
output=Dense(class_count, activation='softmax')(x)
model=Model(inputs=base_model.input, outputs=output)
model.compile(tf.keras.optimizers.Adamax(lr=.001), loss='categorical_crossentropy', metrics=['accuracy'])

Classification report of Scene Recognition Model

Obstruction Detection

We've chosen to use the OpenCV library because YOLOv7 isn't effective for barrier detection in our context. Instead, we rely on colour-based contouring techniques.

Here's how our process works:

Color-Based Contouring: Using OpenCV, we convert the image to grayscale and apply Gaussian blur for noise reduction. Applying binary thresholding highlights areas with significant color intensity changes.
Contour Extraction: From the binary image, we extract contours. We filter out insignificant regions by applying a threshold. If the remaining contours exceed 97% of the image, we consider it a significant change.
Alert Trigger: If the change surpasses the threshold, we trigger an alert. The system informs the user about the wall-like obstruction ahead through an audio cue, advising them to stop.

Percentage - 98.7%, Possible Obstruction ahead

Percentage - 90%, No Obstruction Ahead

Code Snippet 4:

import cv2
import numpy as np
from PIL import Image
img=Image.open('W_1.jpeg')
im0=np.array(img)
gray = cv2.cvtColor(im0, cv2.COLOR_BGR2GRAY)
blurred = cv2.GaussianBlur(gray, (5, 5), 0)
adaptive_threshold = cv2.adaptiveThreshold(blurred, 255, cv2.ADAPTIVE_THRESH_MEAN_C, cv2.THRESH_BINARY, 11, 2)
contours, _ = cv2.findContours(adaptive_threshold, cv2.RETR_EXTERNAL, cv2.CHAIN_APPROX_SIMPLE)
min_contour_area = 1000
contours = [cnt for cnt in contours if cv2.contourArea(cnt) > min_contour_area]
total_area = im0.shape[0] * im0.shape[1]
change_area = sum(cv2.contourArea(cnt) for cnt in contours)
percentage = (change_area / total_area) * 100
print(percentage)
threshold_percentage = 97
if percentage > threshold_percentage:
playsound('w.mp3')
print(f"Detected changes occupy more than {threshold_percentage}% of the screen. Possible Obstuction detected.")

Getting to our main purpose, we need to implement various methods for a streamlined navigation process. We use MapQuest and Geocoder for route planning from the user’s location to the specified destination.

MapQuest: MapQuest, renowned as an early precursor to Google Maps, furnishes an API enabling us to compute precise directions from the starting point to the intended endpoint. These directions are conveyed in the JSON file format, offering comprehensive route details.
Integration with Geocoder: We leverage the Geocoder library to extract crucial latitude and longitude coordinates corresponding to the user's current location. These coordinates serve as the originating point for MapQuest, facilitating the accurate calculation of navigation directions.

By amalgamating MapQuest's directional capabilities with Geocoder's location extraction functionality, we ensure an effective and seamless navigation experience, guiding users from their current position to their desired destination with precision and ease.

Code Snippet 5:

import requests
import geocoder
import speech_recognition as sr
import pyttsx3
from gtts import gTTS
from playsound import playsound
def dir(b):
    g = geocoder.ip('me')
    a=g.latlng
    api_key = 'Your API key'
    url = 'https://www.mapquestapi.com/directions/v2/route'
    origin = a
    destination = b
    params = {
    'key': api_key,
    'from': origin,
    'to': destination,
    }
    response = requests.get(url, params=params)
    data = response.json()
    if response.status_code == 200:
        route = data['route']
    print(route)
    print(f"Distance: {route['distance']} miles")
    print(f"Duration: {route['formattedTime']}")
    else:
        print(f"Error: {data['info']['messages'][0]}")
return route
def text_to_speech(text,filename,language='en'):
    tts = gTTS(text=text, lang=language, slow=False)
    tts.save(filename)
    playsound(filename)
r = sr.Recognizer()
while True:
    try:
        with sr.Microphone() as source2:
            r.adjust_for_ambient_noise(source2, duration=0.2)
            audio2 = r.listen(source2)
            MyText = r.recognize_google(audio2)
            MyText = MyText.lower()
            print(MyText)
            if MyText:
                data=dir(MyText)
            break
    except sr.RequestError as e:
        print("Could not request results; {0}".format(e))
    except sr.UnknownValueError:
        print("unknown error occurred")
text = data['legs'][0]['maneuvers'][0]['narrative']
start_index = text.find("Go for ") + len("Go for ")
end_index = text.find(" mi.", start_index)
if start_index != -1 and end_index != -1:
    distance_str = text[start_index:end_index]
    extracted_number = float(distance_str)
    di=(extracted_number*60)//2
    updated_text = f"{text} This will take approximately {di} minutes."
    text_to_speech(updated_text,'tr.mp3')

Directions for testing

Directions - processing json

Facial Recognition

Facial recognition is a famous topic and has gained immense popularity in the past decade. We integrate it with Guiding Gaze a feature to help user recognise different people. We use the Facial Recognition model called VGG-Face[4] to detect previously recognised individuals by the user.

VGG-Face Architecture Overview:

Based on the VGGNet architecture, specifically designed for face recognition tasks.
Consists of 16 convolutional layers followed by 3 fully connected layers.
Deeper than the original VGGNet, offering improved feature representation for facerelated tasks. Predominantly uses 3x3 convolutional kernels throughout the network.
Typically pretrained on a large-scale face dataset to leverage transfer learning for face recognition tasks.
The network learns hierarchical features, capturing facial patterns at different levels of abstraction.
Initially trained for face identification but adaptable for various facial analysis tasks.

VGG-Face, an extension of the VGGNet architecture, is made specifically for face-related tasks, providing a robust framework for facial recognition and analysis due to its depth, convolutional layers, and learned feature representations.

Facial Recognition

Code Snippet

Future Prospects

Guiding gaze is currently a purely software solution. To make it a usable product, we need to integrate it with electronic components. We intend to use the NVIDIA Jetson Microcontroller for the processing power and combine it with a wireless miniature AI camera that can be easily mounted on glasses or stuck on user’s chest. We then add a 12V battery pack to provide a six hour battery life.

Integrating all this with an app will give us a full-blown usable solution that can be tested in hospitals to help recently handicapped patients get accustomed to their new lifestyle.