Published February 13, 2024 © GPL3+

AI-Powered application for the Blind and Visually Impaired

A game-changing technology powered by a Vision Language Model on the NVIDIA Jetson AGX Orin 64GB Kit

ExpertFull instructions provided6 hours3,498

Honorable Mention Visual Impairments

Build2gether 2.0 — Inclusive Innovation Challenge

AI-Powered application for the Blind and Visually Impaired

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Story

Generative AI is a category of artificial intelligence that creates new text, images, video, audio, or code. Handling the increasing complexity and diversity of data sources, such as images, videos, audio, text, speech, and sensors, is one of the greatest challenges facing edge computing. Multimodal generative artificial intelligence addresses this challenge by enabling the processing of various types of data simultaneously. Although most generative AI applications currently operate in the cloud, many are now investigating the role of edge computing in supporting these technologies.

Generative AI Applications for the Edge

The combination of edge computing and generative AI allows for real-time Generative AI applications. Due to their rising popularity, generative AI applications require some or all AI workloads to be moved to edge devices like smartphones, PCs, and others. There is currently a growing demand for running large language models like ChatGPT on low-powered single board computers for offline and low-latency applications. This project shows how to develop and deploy generative AI–powered application on the NVIDIA Jetson AGX Orin Developer Kit. This edge device is strongly recommended due to its capability to run the inferencing of the largest Llama-2 model, which contains 70 billion parameters.

NVIDIA Jetson AGX Orin 64GB Developer Kit

In my previous project, Vision2Audio: Giving the Blind an Understanding Through AI, I developed an end-to-end web application for visually impaired and blind people using llama.cpp and LLaVA (Large Language and Vision Assistant). Llama.cpp is a library developed by Georgi Gerganov, designed to run LLMs efficiently on CPUs/GPUs. I mentioned several limitations there that I aim to fix in this project. This web application aims to empower visually impaired and blind individuals by providing them with the ability to capture images and ask questions in a real-time, leveraging MLC LLM optimized technology and local_llm(@Dustin Franklin) project, promising faster performance than llama.cpp.

MLC LLM, short for Machine Learning Compilation for Large Language Models, is a framework designed to enable high-performance and universal deployment of large language models. It utilizes compiler acceleration to enable the native deployment of these models with native APIs across various devices, including web, iOS, and Android, and supports multiple programming languages and APIs.

Source: https://anakin.ai/blog/how-to-run-llama-2-locally/

MLC-LLM enables us to run quantized LLM models locally in resource-constrained device like NVIDIA Jetson AGX Orin single board computer.

For simplicity, we will assume everything is installed. Without further ado, let's get started!

Hardware architecture

The use of mobile phones has become an essential part of our daily lives. Therefore, it is critical that our technology is accessible via mobile phones remotely through public networks. Since all processing occurs on the Nvidia Jetson AGX Orin developer kit, located at home and connected to a local network with internet access, finding a solution for remote web application access is crucial. This is where OpenVPN presents itself as a viable solution.

OpenVPN is an open-source Virtual Private Network (VPN) project that creates secure connections over the internet utilizing a custom security protocol based on SSL/TLS. A Raspberry Pi makes an excellent choice for setting up an OpenVPN server due to its ease of configuration. One popular feature is the ability to access servers or services on the remote network, enabling users to connect their iPhones or other devices via VPN to their home network securely.

Here's a brief overview of the scenario I would like to create:

The diagram shows a local network with a router, a smartphone or computer and Nvidia Jetson AGX Orin Developer Kit. The router has two interfaces: a LAN interface and a WAN interface. The LAN interfaces are connected to the Nvidia Jetson AGX Orin Developer Kit and Raspberry Pi. The WAN interface is connected to the internet. Raspberry Pi is acting as OpenVPN server, while smartphone or computer as OpenVPN clients.

Run the command below to install PiVPN on the Raspberry Pi 5.

curl -L https://install.pivpn.io | bash

This command downloads the installation script from PiVPN’s GitHub page and executes it. Then, a dialog box will appear, prompting you to answer a few questions regarding the setup of the OpenVPN server. In this case, we will select the default settings, as they are sufficient to get the server up and running.

1 / 16

The installation is now complete! Reboot your Raspberry Pi.

After reboot, run the command below to start the profile creation.

pivpn add

If you see the messages as below, your installation is successful.

::: Create a client ovpn profile, optional nopass
:::
::: Usage: pivpn <-a|add> [-n|--name <arg>] [-p|--password <arg>]|[nopass] [-d|--days <number>] [-b|--bitwarden] [-i|--iOS] [-o|--ovpn] [-h|--help]
:::
::: Commands:
::: [none]  Interactive mode
::: nopass  Create a client without a password
::: -n,--name Name for the Client (default: "raspberrypi")
::: -p,--password Password for the Client (no default)
::: -d,--days Expire the certificate after specified number of days (default: 1080)
::: -b,--bitwarden  Create and save a client through Bitwarden
::: -i,--iOS  Generate a certificate that leverages iOS keychain
::: -o,--ovpn Regenerate a .ovpn config file for an existing client
::: -h,--help Show this help dialog
Enter a Name for the Client: user
How many days should the certificate last? 1080
Enter the password for the client:
Enter the password again to verify:
* Notice:
Using Easy-RSA configuration from: /etc/openvpn/easy-rsa/pki/vars
* Notice:
Using SSL: openssl OpenSSL 3.0.11 19 Sep 2023 (Library: OpenSSL 3.0.11 19 Sep 2023)
-----
* Notice:
Keypair and certificate request completed. Your files are:
req: /etc/openvpn/easy-rsa/pki/reqs/user.req
key: /etc/openvpn/easy-rsa/pki/private/user.key
Using configuration from /etc/openvpn/easy-rsa/pki/14a25c96/temp.6c37cb45
Check that the request matches the signature
Signature ok
The Subject's Distinguished Name is as follows
commonName :ASN.1 12:'user'
Certificate is to be certified until Jan 10 22:23:31 2027 GMT (1080 days)
Write out database with 1 new entries
Database updated
* Notice:
Certificate created at: /etc/openvpn/easy-rsa/pki/issued/user.crt
Client's cert found: user.crt
Client's Private Key found: user.key
CA public Key found: ca.crt
tls Private Key found: ta.key
========================================================
Done! user.ovpn successfully created!
user.ovpn was copied to:
/home/raspi/ovpns
for easy transfer. Please use this profile only on one
device and create additional profiles for other devices.
========================================================references:

You can access the profile by navigating to the ovpns folder. Then, load the.ovpn file into OpenVPN Connect. The client is capable of connecting to your OpenVPN server. OpenVPN has an official client called OpenVPN Connect, which is available on Windows, macOS, Linux, iOS, and Android.

Once you do, all your traffic will be encrypted, and routed through your home internet connection where the Raspberry Pi resides.

You will also need to set up port forwarding on your internet router. The external and internal port should be 1194 (your custom port), and you should fill in the Raspberry Pi's IP address in the IP address field. The protocol should be UDP, and that's all. Most consumer routers have a web admin interface.

If your ISP uses CGNAT (Carrier-Grade NAT), then you will not have a publicly-reachable IP address on your home network. Unfortunately, this limits the ability to run a private VPN, since you effectively don't have a direct gateway to the public Internet. In this case, you can either explore what it takes to get a publicly routable IP address from your ISP (sometimes business plans offer this feature) or consider using a service like:

This service enables you to acquire a publicly accessible IP address or domain name, which can be utilized to access your localhost web services. Please note that using a service like this may incur additional costs and potentially compromise your personal data.

Here, we will be using Cloudflare Tunnel, as it is more reliable and trustable. Follow these instructions to install cloudflared. You can download the latest release of Cloudflared from the Cloudflare website.

You can download the binary using the following command:

wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64.deb

Once you have downloaded the binary, you can install it using the following command:

sudo dpkg -i cloudflared-linux-arm64.deb

Then you can create a Cloudflare Tunnel by running the following command:

cloudflared tunnel --url http://127.0.0.1:PORT_NUMBER

This will connect your localhost server through the Cloudflare Tunnel.

Another way to implement a tunnel is to use a SSH tunnel. A SSH tunnel uses the SSH protocol to create a secure tunnel between two hosts.

Software architecture

The software architecture of my Vision2Audio web application is designed to ensure ease of use and accessibility for all users.

The main components of the system are:

Fast API Backend App: The backend application is built using Fast API and Python, and it combines the MLC-LLM implementation of local_llm project by Dustin Franklin to provide an image understanding service. The frontend web client communicates with the Fast API backend application through REST API requests. It sends requests to the backend API to initiate audio recording, receive transcription results, and send images for LLAVA.
FrontendWeb Client: The frontend app is designed to be accessible to visually impaired and blind people. It utilizes the MediaRecorder API in JavaScript for streaming audio and video. The captured image is passed to the Fast API Backend App to generate an image description, while the captured audio is passed to the NVIDIA Riva Automatic Speech Recognition system. Howler.js is used to play audio in autoplay mode from NVIDIA Riva Text to speech on the web client. The interface is user-friendly and designed for visually impaired and blind people, allowing users to control it using only one button.
NVIDIA Riva Server: This component serves as the speech recognition and text-to-speech engine. It takes audio data and converts it into text, and vice versa.

Run Vision2Audio web application

Finally, I created the frontend of our web application. First, we created a folder called templates and static in the same directory as app.py.

Here is the resulting file structure:

├── app.py
├── templates
│   └── index.html
├── static
│   └── howler.min.js
|   └── microphone-solid.svg
|   └── script.js
|   └── style.css

Start Nvidia Riva server by running the command:

bash riva_start.sh

You will see the following output:

Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...
Use this container terminal to run applications:

Once the Riva server is running, open another terminal and execute the following command:

git clone https://github.com/dusty-nv/jetson-containers.git

In this project, we will use the pre-built docker container from jetson-containers project. It is also possible to build the project from the source, but it will take more time.

./run.sh -v ./webapp:/app/ $(./autotag local_llm)

Docker container will automatically run and pull all the files, it will take 5-20 minutes depending on the network.

Inside the container, run the following commands:

pip3 install pydub 
pip install python-multipart 
sudo apt-get update 
sudo apt-get install ffmpeg

Determine the size of the LLava model you wish to use, such as 7B or 13B for 7 billion or 13 billion parameters, respectively.

Finally, run the following command to start the application:

python3 -m uvicorn app:app --port 5000 --host 0.0.0.0 --ssl-keyfile ./key.pem --ssl-certfile ./cert.pem

This command will start the application on port 5000 and make it accessible from outside the container.

INFO:     Started server process [2215]
INFO:     Waiting for application startup.
INFO:     Application startup complete.
INFO:     Uvicorn running on https://0.0.0.0:5000 (Press CTRL+C to quit)

It will load the model parameters into the local cache on the first run, which may take a few minutes. Since the settings will already be cached, subsequent refreshes and runs will be faster.

Then point your phone's camera at the image you want to describe. The app will then use its automatic speech recognition and text-to-speech capabilities to describe the image to you.

Most browsers on iOS devices have a restriction that prevents audio files from playing automatically when a web page loads. This restriction is in place to prevent unwanted audio from playing without the user's consent. However, Howler.js, a JavaScript library used in Vision2Audio, has found a way to work around this restriction by using user-triggered events to initiate audio playback.

Click Record a prompt and wait 5 seconds for it to finish. This feature is particularly useful for blind users who may have difficulty navigating the page and finding the stop button. The number of buttons on the page has been reduced to make it easier for visually impaired users to find and interact with the audio recording feature. The app has been tested on variants of the iPhone 14 Pro Max, iPhone SE and Google Pixel.

The following examples illustrate real-time usage of Vision2Audio.

One of the key features of Vision2Audio is its real-time narration capabilities. This allows users to receive audio descriptions of their environment. This can be incredibly helpful for navigating unfamiliar areas, as it provides users with valuable information about their surroundings.

I was able to test the Optical Character Recognition (OCR) capabilities of Vision2Audio web app.

The app's ability to accurately transcribe text from images is truly impressive, and it opens up a world of possibilities for users who are blind or have low vision.

This real-time narration provides users with valuable information about their environment, including the location of objects, people, and text.

When you're walking around, Vision2Audio can be used to understand where you are and describe the pictures around them. This can help them avoid obstacles and find way around unfamiliar areas. The system can help to prevent falls and other accidents.

Vision2Audio isn't limited to outdoor navigation, however. It can also be incredibly useful for everyday tasks at home. For example, you can use it to describe the contents of your fridge, making meal planning easier. By bridging the gap between perception and understanding, this AI application empowers users to navigate their environment and connect with the world around them like never before.

Overall, the architecture of Vision2Audio technology is designed to provide a comfortable and accessible experience for all users, regardless of their visual abilities. By leveraging the power of mobile phones and advanced multi-model VLM, this project aims to make this technology an essential tool for visually impaired and blind individuals to navigate their environment and stay connected with the world around them.

Limitations

Hallucinating responses: The large language model can sometimes generate responses that include inaccurate or unsupported information. This means that VLMs are not guaranteed to produce 100% accurate descriptions of images.
Network latency: This can be considered a non-deterministic part of the project. Internet connections can vary due to various circumstances.

Thanks and acknowledgements

Authors of LLaVA: Large Language and Vision Assistant (Haotian Liu*,Chunyuan Li*, Qingyang Wu, Yong Jae Lee).
It would not be possible without the local_llm project by Dustin Franklin from Nvidia.
Thanks to MLC LLM, an open-source project, we can now run LLaVA on the NVIDIA Jetson AGX Orin Developer Kit.
Thanks to Makat Tlebaliyev and Azamat Iskakov

References

Credits

Nurgaliyev Shakhizat

72 projects • 187 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Contact

Comments

Please log in or sign up to comment.

Awards

Honorable Mention Visual Impairments

Build2gether 2.0 — Inclusive Innovation Challenge

Embed the widget on your own site

AI-Powered application for the Blind and Visually Impaired

AI-Powered application for the Blind and Visually Impaired

Things used in this project

Hardware components

Story

Generative AI Applications for the Edge

Hardware architecture

Software architecture

Run Vision2Audio web application

Limitations

Thanks and acknowledgements

References

Code

Vision2Audio

Credits

Nurgaliyev Shakhizat

Comments

Awards

Related channels and tags