Blind and visually impaired individuals often encounter a variety of socioeconomic challenges that can hinder their ability to live independently and participate fully in society. However, the advent of machine learning has opened up new possibilities for the development of assistive technologies. In this study, we utilized image captioning and text-to-speech technologies to create a device that assists those who are visually impaired or blind. Image captioning combined with text-to-speech technology can serve as an aid for the visually impaired and blind.
Furthermore, I would like to share my experience on optimizing a deep learning model with TensorRT to improve its inference time. For more information, please refer to the preprint available on TechRxiv titled: Image Captioning for the Visually Impaired and Blind: A Recipe for Low-Resource Languages.
For simplicity we will assume everything is installed.
As Single Board Computers (SBC) are becoming increasingly popular to run AI and Deep learning projects, some have even been specially designed just to run AI and Deep learning stuff. We utilized a reComputer NVIDIA Jetson Xavier NX from SeeedStudio (@seeedstudio), as the brain of our system. reComputer J20 comes with Jetson Xavier NX delivers up to 21 TOPS, making it ideal for high-performance compute and AI in embedded and edge systems.
NVIDIA Jetson devices are compact and energy-efficient are capable of executing machine learning algorithms in real-time. Nevertheless, it can be difficult to deploy intricate deep learning models on these devices, which have limited memory. To overcome this, we utilized inference optimization tools like TensorRT, which enable us to execute deep learning models on edge devices by reducing the memory footprints.
The Intel RealSense D455 depth camera is an upgrade to the D435 in terms of field of view and resolution. It has a wider field of view of 86° x 57° compared to the D435's 70° x 53°, and a higher resolution of 1280 x 720 compared to the D435's 640 x 480.
This makes it better suited for applications that require capturing more detail or a wider area, such as robotics, assistive devices, mapping, and virtual reality.
Image Captioning Model deployment pipelineWe used the popular Microsoft COCO 2014 (COCO) benchmark dataset to train the ExpansionNet v2 image captioning model. The dataset consisted of 123, 287 images, with each image having five human-annotated captions, resulting in a total of over 600, 000 image-text pairs. We split the dataset into training (113, 287 images), validation (5, 000 images), and test (5, 000 images) sets, using the Karpathy splitting strategy for offline evaluation. To generate captions in Kazakh, we translated the original English captions using the freely available Google Translate service.
To train the model for Kazakh captions, we followed the model architecture defined in the original work of the ExpansioNet v2. The pre-trained Swin Transformer was used as a backbone network to generate visual features from the input images. The model was trained on four V100 graphics processing units (GPUs) in Nvidia DGX-2 server.
Finally, the image captioning model, ExpansionNet v2, was deployed on the Nvidia Jetson Xavier NX board. The camera was triggered by pressing the push button to capture an RGB image with a resolution of 640 × 480 pixels. Then, the captured image was resized to 384 × 384 and passed to the ExpansionNet v2 model to generate a caption. Next, the generated caption text was converted into audio, using a text-to-speech model. In our research study, we utilized the KazakhTTS model to convert Kazakh text to speech. Finally, the generated audio was played through the user’s headphones, making it possible for individuals who are blind or visually impaired to comprehend what is in front of them.
ONNX overviewONNX is an open format for machine learning and deep learning models. It allows you to convert deep learning and machine learning models from different frameworks such as TensorFlow, PyTorch, MATLAB, Caffe, and Keras to a single format.
The workflow consists of the following steps:
- Convert the regular PyTorch model file to the ONNX format. The ONNX conversion script is available here.
- Create a TensorRT engine using trtexec utility
trtexec --onnx=./model.onnx --saveEngine=./model_fp32.engine --workspace=200
- Run inference from the TensorRT engine.
TensorRT is a high-performance deep learning inference engine developed by NVIDIA. It optimizes neural network models and generates highly optimized inference engines that can run on NVIDIA GPUs. TensorRT uses a combination of static and dynamic optimizations to achieve high performance, including layer fusion, kernel auto-tuning, and precision calibration.
PyTorch, on the other hand, is a popular deep learning framework that is widely used for research and development. PyTorch provides a dynamic computational graph that allows users to define and modify their models on the fly, which makes it easy to experiment with different architectures and training methods.
It appears that the TensorRT model is providing faster inference results compared to the PyTorch model. The TensorRT model is taking around 50% less time to process the images compared to the PyTorch model, even though it has a smaller file size.
In a nutshell, if speed and efficiency are your primary concerns, then TensorRT may be a better choice. This is fast enough for most real-time object detection applications.
During the inference process, you can check the current performance of the Nvidia Jetson boards using jetson-stats utility. You can monitor the resources that your models are using in real time and get maximum utilization out of your hardware.
A real-world experiment with a human subject wearing the image captioning assistive deviceThis figure illustrates the real-world experiment of our image captioning assistive system, which comprised a camera, a single-board deep learning computer (Nvidia Jetson Xavier NX), a push button, and headphones.
The camera was connected to the single-board computer through a universal serial bus (USB), while the push button and headphones were connected to the general-purpose input/output (GPIO) pins and audio port of the single-board computer, respectively. The Intel RealSense camera was secured to the user's forehead using adjustable straps, while the user carried the single-board computer (and a power bank) in a backpack and wore the headphones during operation.
Combining Image Captioning, Object Detection, Depth Perception, and Spatial Audio for Improved AccessibilityImage captions have the ability to describe the surrounding environment, which is beneficial for visually impaired and blind individuals. However, they do not provide information about the exact location of the object in the 3D-dimensional world. To address this limitation, our proposed solution integrates image-to-text technology, object detection, depth perception through the use of an Intel RealSense camera, and spatial audio techniques.
In this context, we will utilize the Intel RealSense Depth Camera D455's capabilities in both RGB (Red Green Blue) imaging and its functionalities for object detection, as well as employ its stereo vision features to compute depth. Intel Realsense SDK offers a range of post processing filters(Filter Magnitude, Smooth Alpha, Smooth Delta, and Hole Filling) that could drastically improve the quality. A filter can correct outliers by using surrounding pixel values in places where depth data is not obtained. We can fill in some of the gaps on the image by applying a Hole-Filling post-processing filter using below command in Python:
hole_filling = rs.hole_filling_filter(2)
depth_frame = hole_filling.process(depth_frame)
For object detection, we utilized You Only Look Once (YOLO), specifically evaluated the performance of YOLOv5, YOLOv6, and YOLOv7 object detectors. All of these algorithms demonstrated strong inference capabilities. Then, we converted regular PyTorch-based YOLO models into Nvidia's TensorRT format and then quantized the model by decreasing the model precision from floating point 32 to 16. Quantized models offer various benefits such as reduced memory footprint and improved computational efficiency.
Here are the inference results:
Spatial audio technology simulates how humans naturally perceive sound in three dimension world. To achieve this, assistive device generates audio based on two factors:
- Distance from an object: This information is derived from the depth sensor of the Intel RealSense camera, allowing the sound to be positioned accurately in the virtual space relative to the user.
- Gaze of a visually impaired user: Using data from the RGB camera, the system can determine where the user is looking and adjust the audio positioning accordingly.
The visual field is divided into a grid of 3x3 fields. Each field has the capability to produce a beep sound, the generation of sound depends on the distance and gaze of the blind user in relation to the object. The sound is then passed to the headphones, where it will play through the left earpiece for objects located to the left and the right earpiece for objects located to the right. If a user hears a sound coming from both the left and right speakers, it indicates that the user's gaze is directly on the object. The assistive device has a limited range of 50cm, which is primarily a limitation of the Intel Realsense camera.
We conducted four experiments with blindfolded subjects:
- Without assistive device
- With assistive device using horizontal feedback
- With assistive device using horizontal and distance feedback
- With assistive device using horizontal, vertical, and distance feedback
Blindfolded participants carried out 30 trials in a single session, with each trial lasting no longer than 2 minutes. Then, they were instructed to find an object (a teddy bear) placed on the wall of a safe room with normal illumination and no obstacles. Participants were asked to solely rely on the audio feedback provided by the systems and not use their hands to feel where walls are. After the experiment, we conducted the System Usability Scale and the NASA Task Load Index to assess the users' experience.
Here are the average results for the four experiments with blindfolded subjects:
- Experiment 1: Without assistive device
Total Average = 36.68977874 seconds - Experiment 2: With assistive device using horizontal feedback
Total Average = 29.51056874 seconds - Experiment 3: With assistive devic using horizontal and distance feedback
Total Average = 27.98622182 seconds - Experiment 4: With assistive device using horizontal, vertical, and distance feedback. Total Average = 27.76512912 seconds
The study demonstrated significant improvements in participant performance when using the assistive device compared to the baseline condition. This suggests that the proposed solution has the potential to enhance the mobility and independence of visually impaired individuals.
Conclusion and further improvementsVisually impaired and blind individuals face unique challenges in their daily lives, including the inability to independently access visual information. Image captioning technology has shown promise in providing assistance to this community.
In addition to the existing image captioning and text-to-speech technologies, we aim to incorporate Visual Question Answering (VQA) functionality into our assistive device for the visually impaired and blind. This will enable users to ask questions about the images and receive spoken answers.
To further optimize our deep learning model and improve its performance, we will perform quantization from FP32 to FP16 or INT8. This will reduce the memory footprint and computation time required for inference, making our assistive device more efficient.
If you are interested in our project, please consider adding a star to our repository on github. Thanks a lot!
I hope you found this research study useful and thanks for reading it. If you have any questions or feedback, leave a comment below. Stay tuned!
Acknowledgements- This project was made possible through the support, guidance, and assistance of the staff of Institute of Smart Systems and Artificial Intelligence.
- The implementation of the Image captioning model relies on ExpansioNet v2.
Comments