Paging Doctor AI

TinyLLaVA-Med is a small AI model for medical diagnostics that was optimized for platforms like the NVIDIA Jetson to make it accessible.

The approach to building TinyLLaVA-Med (📷: A. Mir et al.)

Before they can make an accurate diagnosis, physicians often need to pull in data from a variety of sources, including medical images, written reports, laboratory tests, and measurements captured by sensors. There is a lot of complexity in this process, and that can lead to medical conditions being incorrectly diagnosed from time to time. That, in turn, leads to poor outcomes for patients. But help may be on the way — artificial intelligence (AI) algorithms have shown a tremendous potential in their ability to characterize complex, multimodal data of this sort.

In fact, a number of models have already been developed to assist medical professionals in making diagnoses. Models like LLaVA-Med, Med-PaLM, and BiomedCLIP integrate large language models with vision encoders to help with complex reasoning tasks involving both text and medical images. Models such as these have already proven themselves to be valuable in the decision-making process, however, they generally require a very large amount of computational resources for operation. This makes them impractical for use in many situations, especially in underserved and other resource-limited environments.

The TinyLLaVA architecture (📷: A. Mir et al.)

Recent work carried out by a team at New York University Abu Dhabi may make these AI systems more accessible in the future. The researchers have developed what they call TinyLLaVA-Med, which is a pint-sized multimodal large language model (LLM) targeted at medical diagnostics. Despite being small in size and capable of running on modest computing platforms, TinyLLaVA-Med punches well above its weight in terms of accuracy and performance.

The team started with the general-purpose TinyLLaVA LLM, then adapted it for use in medical applications through a sequential approach that included instruction-tuning and fine-tuning on specialized datasets.

During the instruction-tuning phase, the TinyLLaVA model was fine-tuned to better handle medical dialogues and follow diverse instructions. The model’s projection layer and language model weights were updated, while the visual encoder weights remained unchanged. Biomedical instruction-tuning data from LLaVA-Med, sourced from the PMC-15M dataset, which includes 60,000 image-text pairs, was used. The dataset was further enhanced by incorporating sentences from PubMed articles and using GPT-4 to generate multi-round conversational data. This phase significantly improved the model's ability to interact effectively in medical contexts, transforming it into TinyLLaVA-Med.

An experimental deployment of the system (📷: A. Mir et al.)

Following the instruction-tuning, the model underwent fine-tuning on specialized biomedical Visual Question Answering datasets such as VQA-RAD and SLAKE. These datasets contain both open-ended and close-ended medical questions and served as benchmarks to evaluate and further improve the model’s performance. This fine-tuning step ensured that TinyLLaVA-Med achieved high accuracy and became specialized in medical visual question answering tasks.

With a useful model in hand, the team had to demonstrate that it was practical for deployment in resource-constrained environments. Toward that goal, they deployed TinyLLaVA-Med on an NVIDIA Jetson Xavier single-board computer. While drawing only 18.9 watts of energy and consuming 11.9 GB of memory, the model was demonstrated to achieve an accuracy level of 64.5 percent on VQA-RAD and 70.7 percent on SLAKE for closed-ended questions. These accuracy levels are close to today’s state-of-the-art models, showing that not only is TinyLLaVA-Med capable of running on a shoestring budget, but it can also provide physicians with meaningful help in their work.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles