People who are visually impaired or blind encounter numerous challenges in their daily lives, such as difficulty perceiving objects and their surroundings along their routes. Individuals with visual impairments may face limitations in activities such as swimming, playing computer games, walkings and many other recreational activities. To put simply, there remains a gap in this field with respect to enhancing the quality of life for individuals with blindness and vision impairment.
The emergence of large language models (LLMs) has paved the way for multimodal LLMs, which are capable of Text to Image Retrieval, Image Captioning and Visual Question Answering. These three tasks are crucial for making digital content accessible to individuals with visual impairments. By providing comprehensive image description, it enables visually impaired individuals to better understand and navigate their surroundings.
Vision2Audio is a proof of concept idea. It is an assistive technology project that bridges the gap between sight and sound, allowing visually impaired and blind individuals to perceive their environment in a way that was previously impossible. Using LLaVA (Large Language-and-Vision Assistant) model via Llama.cpp enables us to run quantized LLM models locally in resource-constrained environment like Nvidia Jetson board single board computer(SBC).
Nowadays, there is a growing demand for running ChatGPT-like large language models on low powered edge devices for offline and low-latency applications, such as this project. In this project, I will be using the NVIDIA Jetson AGX Orin Developer Kit. It is a highly recommended edge device, since you can run even the inference of the largest Llama-2 model, which has 70 billion parameters. You can order it from Seeed Studio. However, you can use any other consumer-grade hardware with Nvidia GPU, if you wish.
This web application aims to empower visually impaired and blind individuals by providing them with the ability to capture images and ask questions about them using Automatic Speech Recognition (ASR). Then it utilizes Text-to-Speech (TTS) technology to provide them with answers to their inquiries.
For simplicity we will assume everything is installed. Without further ado, let’s get started!
LLaVA (Large Language and Vision Assistant)Let’s start by explaining what is LLaVA.
LLaVA (Large Language and Vision Assistant) is a new end-to-end trained model that combines an image encoder and large language model, and is a model similar to GPT4-V's open source model. It was developed by Microsoft Research and released in September 2023. It leverages the pre-trained vision encoder of CLIP to understand images and the pre-trained language decoder of GPT-4 to understand text. In this project, I will be using an implementation of LLaVA-1.5 that can convert images to text.
Below is the illustration showing that LLaVA-1.5 achieved SoTA on a wide range of task benchmarks.
This model performs well in image to text description and visual question answering applications.
The usage examples:
- Zero Shot Object Detection
- Image Understanding
- Optical Character Recognition (OCR)
Use the demo through Google Colab to check the functionality of the LLaVA-1.5.
With its ability to convert images to text, Llava fosters greater accessibility for visually impaired individuals. By providing detailed descriptions of images, this AI model enhances their digital experiences and can potentially improve the quality of life.
Inference of LLaVA with llama.cppAs you may already know, LLama is a free and open source large model released by Meta. It is a large language model with tens of billions of parameters and supports both CPU and GPU running methods. However, it may be challenging to run vanilla LLama models on consumer-grade hardware. Thus llama.cpp comes into play. Llama.cpp is an implementation of LLaMA architecture in C/C++, created by Georgi Gerganov. Llama.cpp expects the LLM model in a gguf format, which was recently published in August 2023. This format is used to load the weights and run the C++ code. Llama.cpp comes with a server mode.
First, let’s get Llama.cppand set it up. Firstly, clone it.
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
Then, compile it using cmake. My target device is Nvidia Jetson AGX Orin, so I will be using 87 as the CUDA architecture.
# TX1, Nano - 53
# TX2 - 62
# AGX Xavier, NX Xavier - 72
# AGX Orin, NX Orin - 87
mkdir build
cd build
cmake .. -DLLAMA_CUBLAS=on -DLLAMA_CUDA_F16=1 -DCMAKE_CUDA_ARCHITECTURES=87
cmake --build . --config Release --parallel $(nproc)
Download quantized ggml-model-*.gguf and mmproj-model-f16.gguf models from here. Place the model in the models folder of the within the cloned llama.cpp repository. We’ll use q4_k, which balances speed and accuracy well.
Now that llava and server executable files has been created, all you have to do is call it. You have two main ways to run your Llama.cpp models:
- via CLI Inference —The model loads, runs the prompt, and unloads in one go. Good for a single run.
Run llava command. You can use below command:
#Maximum value of -ngl for each model
#7B: 35
#13B: 43
#34B: 51
./bin/llava -m models/llava1.5-13B/ggml-model-q4_k.gguf
--mmproj models/llava1.5-13B/mmproj-model-f16.gguf
--image images/test.png
-ngl 35
-p ' What breed of dog is it?'
This enables offloading computations to the GPU when running the model using the -ngl 35 parameter flag. You can see that GPU is enabled. Also, 7B seems to be able to offload up to 35 layers.
llm_load_tensors: offloaded 35/35 layers to GPU
llm_load_tensors: VRAM used: 12603.02 MB
You should get the similar output on the terminal
prompt: 'What breed of dog is it?'
The breed of dog in the image is a Collie, which is characterized by its long fur and a noticeable mane.
main: image encoded in 1550.92 ms by CLIP ( 2.69 ms per image patch)
llama_print_timings: load time = 27242.83 ms
llama_print_timings: sample time = 1.46 ms / 29 runs ( 0.05 ms per token, 19822.28 tokens per second)
llama_print_timings: prompt eval time = 1759.37 ms / 621 tokens ( 2.83 ms per token, 352.97 tokens per second)
llama_print_timings: eval time = 5582.70 ms / 29 runs ( 192.51 ms per token, 5.19 tokens per second)
llama_print_timings: total time = 34239.95 ms
Approximately 5 tokens/second. The output of 7B model is as follows:
prompt: What breed of dog is it?'
The dog is a black and white Collie.
main: image encoded in 1686.72 ms by CLIP ( 2.93 ms per image patch)
llama_print_timings: load time = 11039.10 ms
llama_print_timings: sample time = 0.85 ms / 11 runs ( 0.08 ms per token, 12895.66 tokens per second)
llama_print_timings: prompt eval time = 968.37 ms / 621 tokens ( 1.56 ms per token, 641.29 tokens per second)
llama_print_timings: eval time = 1133.82 ms / 11 runs ( 103.07 ms per token, 9.70 tokens per second)
llama_print_timings: total time = 12950.34 ms
- Via Server Inference — The model loads into RAM and starts a server. It stays loaded as long as the server is running.
To run the Llava server, execute the following command in your terminal:
./bin/server -m models/llava1.5-13B/ggml-model-q4_k.gguf
--mmproj models/llava1.5-13B/mmproj-model-f16.gguf
--port 8080
-ngl 35
-t 20
Once the server is running, you can interact with your LLM API by sending a GET request to http://localhost:8080/completion using a web browser or a tool like curl.
The below code snippet shows how a Python script can interact with Llava server
import json
import requests
import base64
from io import BytesIO
from PIL import Image
import shutil
def image_to_base64(img_path):
with Image.open(img_path) as image:
image = image.resize((336, 336))
buffered = BytesIO()
image.save(buffered, format="PNG")
img_str = base64.b64encode(buffered.getvalue())
return img_str.decode('utf-8')
def main(image_path, user_query):
image_data = [{"data": image_to_base64(image_path), "id": 12}]
data = {
"prompt": f"USER:[img-12]{user_query}\nASSISTANT:",
"n_predict": 128,
"image_data": image_data,
"stream": True
}
headers = {"Content-Type": "application/json"}
url = "http://localhost:8080/completion"
response = requests.post(url, headers=headers, json=data, stream=True)
def generate_response():
for chunk in response.iter_content(chunk_size=4096):
if chunk:
try:
chunk_json = json.loads(chunk.decode().split("data: ")[1])
content = chunk_json.get("content", "")
if content:
yield content
except json.JSONDecodeError:
continue
content = ""
for chunk in generate_response():
content += chunk
print("Received content:", content)
if __name__ == '__main__':
image_path = 'Path to image'
user_query = "Describe the image"
main(image_path, user_query)
When sending the first request to Llava server, it will take some time to load the model and store it to RAM, so it does not require loading for the next prompt.
Overall, this code sets up a server and runs a chatbot program that uses the LLAVA multi-model language model to generate text based on the given prompt and image. This implementation opens up a world of possibilities for integrating your multi-modal LLM models into various applications and services.
Llava can be used in a variety of ways. Examples are shown below.
Example 1 - Swimming activitiesSwimming can be a safe, relaxing, and healthy activity for all people including those who are visually impaired or blind.
The response result is as follow:
This will help to prevent injuries as well and eliminate the need of buddy for a blind user.
With a few adaptations using Llava image description approach, swimming can be enjoyed safely and effectively by everyone.
Example 2 - WalkingThis approach can help visual impaired and blind people to navigate and walk safely independently without the need of a buddy guy and a white cane.
The system can help to prevent falls and other accidents.
Visually impaired and blind people can use this system to increase their confidence and mobility.
This approach can provide a comprehensive solution for enhancing the mobility and safety of visually impaired individuals. Overall, this approach has the potential to revolutionize the way that visually impaired and blind people live their lives.
Example 3 - Home activitiesVisually impaired and blind people may have difficulty identifying different types of food. Image-to-text technology can be used to identify food items in images of food.
By providing them with the information they need to eat food more easily and safely, image-to-text technology can help these individuals to live more independent and fulfilling lives.
Vision2Audio - Hardware architectureNowadays, use of mobile phones as an essential part of our daily lives, it is crucial that our technology be accessible via mobile phones. All processing is conducted on the Nvidia Jetson AGX Orin developer kit, which can be located at home and connected to a local network with internet access.
To give a quick overview of what scenario I want to create check out the picture below.
The diagram shows a local network with a router, a smartphone or computer, Nvidia Jetson AGX Orin Developer Kit. The router has two interfaces: a LAN interface and a WAN interface. The LAN interface is connected to the Nvidia Jetson AGX Orin Developer Kit. The WAN interface is connected to the internet.
The tunnel can be implemented using a variety of technologies, such as VPN, SSH, Cloudflare Tunnel, or WireGuard. The specific technology used will depend on the specific requirements of the network.
There are various methods to share localhost to the internet, including:
Here, we will be using Cloudflare Tunnel, as it is more reliable and trustable. Follow these instructions to install cloudflared. You can download the latest release of Cloudflared from the Cloudflare website.
You can download the binary using the following command:
wget https://github.com/cloudflare/cloudflared/releases/latest/download/cloudflared-linux-arm64.deb
Once you have downloaded the binary, you can install it using the following command:
sudo dpkg -i cloudflared-linux-arm64.deb
Once you have installed Cloudflared, you can create a Cloudflare Tunnel by running the following command:
cloudflared tunnel --url http://127.0.0.1:PORT_NUMBER
This will connect your localhost server through the Cloudflare Tunnel.
Another way to implement a tunnel is to use a SSH tunnel. A SSH tunnel uses the SSH protocol to create a secure tunnel between two hosts.
Vision2Audio - Software architectureThe software architecture of our Vision2Audio technology is designed to ensure ease of use and accessibility for all users. It involves the following components:
- Image and Audio Capture: A robust image and audio capture mechanism, such as a smartphone camera via web browser, is employed to capture visual and auditory information. It utilizes JavaScript's MediaRecorder API for steaming audio and video. The captured image is then passed to the LLaVa server to generate an image description. The captured audio is passed to the NVIDIA Riva Automatic Speech Recognition system.
- Automatic Speech Recognition system: The MediaRecorder API is used to capture audio input from the user's microphone. Then, the audio data is captured and sent to the NVIDIA Riva Automatic Speech Recognition system in order to convert it into text.
- Llava server: The Llava server is responsible for processing images and generating descriptions for them. It uses a LLAVA multi-model language model to extract meaning from the images and generate descriptive captions.
- Backend app: The backend app is a Flask-based application that combines the ASR system and the LLAVA server to provide a complete speech-to-text and image understanding service. The app accepts audio and image data from the frontend app, processes it using the ASR system and Llama server, and returns the results to the frontend app using NVIDIA Riva Text-to-Speech.
- Text-to-Speech: The processed information from LLava server is converted into spoken words using a text-to-speech engine of Nvidia Riva server. The results of the image processing are then displayed on a screen and output through a text-to-speech engine, making it accessible to visually impaired and blind people.
- Mobile Application Interface: The Vision2Audio technology also includes a user-friendly interface for visually impaired and blind people that allows users to control it using two buttons. It presents the transcription results, and shows the generated captions for the images.
The frontend app is designed to be accessible to visually impaired and blind people. It can be used jointly with screen readers and other assistive technologies to provide an equivalent user experience.
The frontend application communicates with the backend application through post requests of REST API. It sends requests to the backend API to initiate audio recording, receive transcription results, and submit images for captioning.
Overall, the architecture of Vision2Audio technology is designed to provide a comfortable and accessible experience for all users, regardless of their visual abilities. By leveraging the power of mobile phones and advanced multi-model LLM models, this project aims to make this technology an essential tool for visually impaired and blind individuals to navigate their environment and stay connected with the world around them.
Vision2Audio - How it worksFirstly, install and run the Nvidia Riva server. You can follow this quick start guide for installation instructions. You can also watch below video explanation by Jetson hacks.
Start Riva server by running the command:
bash riva_start.sh
You will see the following output:
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Waiting for Riva server to load all models...retrying in 10 seconds
Riva server is ready...
Use this container terminal to run applications:
Once the Riva server status is running, open another terminal and execute the following command:
./bin/server -m models/llava1.5-13B/ggml-model-q4_k.gguf
--mmproj models/llava1.5-13B/mmproj-model-f16.gguf
--port 8080
-ngl 35
Keep the server running in the background. Then run:
python3 -m flask run --host=0.0.0.0 --debug
Optional: You can generate a self-signed server and client certificates with OpenSSL
openssl req -newkey rsa:4096 -x509 -sha256 -days 365 -nodes -out cert.pem -keyout key.pem
If everything is correct, we will get the following screen:
WARNING: This is a development server. Do not use it in a production deployment. Use a production WSGI server instead.
* Running on all addresses (0.0.0.0)
* Running on http://127.0.0.1:5000
* Running on http://192.168.0.104:5000
Open another terminal and run cloudflared tunnel using the following command:
cloudflared tunnel --url http://127.0.0.1:5000
You should get the similar output on the terminal:
2023-11-25T10:03:43Z INF Thank you for trying Cloudflare Tunnel. Doing so, without a Cloudflare account, is a quick way to experiment and try it out. However, be aware that these account-less Tunnels have no uptime guarantee. If you intend to use Tunnels in production you should use a pre-created named tunnel by following: https://developers.cloudflare.com/cloudflare-one/connections/connect-apps
2023-11-25T10:03:43Z INF Requesting new quick Tunnel on trycloudflare.com...
2023-11-25T10:03:47Z INF +--------------------------------------------------------------------------------------------+
2023-11-25T10:03:47Z INF | Your quick Tunnel has been created! Visit it at (it may take some time to be reachable): |
2023-11-25T10:03:47Z INF | https://lb-mills-gzip-calendars.trycloudflare.com
To use Vision2Audio, simply open the web browser and go to the below link, that was generated by cloudflared:
https://lb-mills-gzip-calendars.trycloudflare.com
Then point your phone's camera at the image you want to describe. The app will then use its voice recognition and text-to-speech capabilities to describe the image to you.
Click "Record a prompt" and wait 5 seconds for it to finish. I reduced the number of buttons so blind users can easily capture audio. To simplify the process, I enabled autoplay, but Apple currently blocks autoplay audio for security reasons. iOS users must press the "Convert to speech" button.
Vision2Audio can be used together with a mobile phone's screen reader. It is best practice to provide fully descriptive text for buttons, basically what I did here, so blind user can easily navigate.
Vision2Audio is a mobile application that can be used by blind users to capture and describe images using voice. This can be helpful in a variety of situations, such as:
When in a store, blind user can use Vision2Audio to describe the shelves to understand what is available.
When at home, blind user can use Vision2Audio to describe fridge to see what they have on hand. This can help to plan their meals.
When you're walking around, Vision2Audio can be to understand where you are and describe the pictures around them. This can help them avoid obstacles and find way around unfamiliar areas.
Then, I asked my friend to test Vision2Audio in London. Even though Nvidia Jetson is located in Kazakhstan, I provided him with a tunnel connection, so he was able to test the Optical Character Recognition (OCR) capabilities of Vision2Audio web app. The results can be seen below and they are pretty impressive.
The following example illustrates real-time usage of Vision2Audio.
Vision2Audio - A demo video on a laptopBlind users can point their webcams towards objects or scenes, and the app will promptly generate comprehensive audio descriptions. This feature is particularly useful for tasks such as reading documents, identifying objects in photos, and exploring online content.
Blind users can also read street signboards through audio descriptions using Vision2Audio.
Then, discover how the app operates on a smartphone.
Vision2Audio - A demo video on a smartphoneFor iOS users, clicking the button is required to initiate audio playback on the web browser due to security considerations. Autoplay is blocked by default. While workarounds may exist, they will not be discussed here.
This real-time narration provides users with valuable information about their environment, including the location of objects, people, and text.
Both demonstrations showcase the app's ability to convert visual information into audio, providing an accessible experience for those with visual impairments. I explored how to set up and run a multi modal large language model on the NVIDIA Jetson devices, enabling visually impaired and blind people to have conversational AI capabilities locally.
Limitations- Hallucinating responses: As mentioned by Meta here, the language model may hallucinate information or make up facts that are not accurate or supported by evidence. It is not guaranteed that the multi modal LLM will generate 100% accurate image descriptions.
- No real-time inferencing: Inference of LLM models using llama.cpp may not be the most optimal solution to run LLM models locally. Nvidia GPUs and Nvidia Jetson devices offer native support for the TensorRT framework, which is designed to maximize the performance of these GPUs. Therefore, utilizing TensorRT on Nvidia GPUs is often the preferred approach for achieving the full potential of these powerful devices.
- User interface: It is preferable to develop a standalone iOS and Android application, so that blind users can easily access it without the need to open a browser. The buttons should also be eliminated. It must provide a simple UI for low vision people.
- Network latency: This can be considered a non-deterministic part of the project. Internet connections can vary due to various circumstances. Also, tunnels do not have a 100% uptime guarantee.
- On-device machine learning: It is preferable to run all models on a device(e.g. in a mobile app or web browser), so a blind user does not depend on an internet connection and it does not compromise their privacy data via cloud providers and tunnels. However, current mobile phones's CPU and GPU cannot handle running multi-modal large language models on-device, but this is expected to change in the future soon. So, it is just a matter of time.
- Authors of LLaVA: Large Language and Vision Assistant(Haotian Liu*, Chunyuan Li*,Qingyang Wu,Yong Jae Lee).
- It would not be possible without the llama.cpp project by Georgi Gerganov.
- LLaVA-v1.5-13B quantized models by M. Yusuf Sarıgöz.
- Bakllava quantized models by M. Yusuf Sarıgöz.
- Design of web app was taken from LLaVaVision project - A simple Be My Eyes web app with a llama.cpp/llava backend.
- Photos from London was taken by Madiyar Aitbayev.
I hope you found this project useful and thanks for reading. Together, let's enhance accessibility and assistive technology for the visually impaired, creating a more inclusive world where they can navigate and interact with greater ease and independence. If you have any questions or feedback? Leave a comment below.
All the code referenced in this story is available in my github repo.
References
Comments