Addressing the problem of overworked medical professionals and manpower shortages in Taiwan, we have developed a system that utilizes cloud-based GPU resources to streamline symptom evaluation process, provide medical assistance, and optimize the use of limited medical resources.
In Taiwan, the distribution of medical resources is notably uneven. According to statistics from the Ministry of Health and Welfare, most doctors are concentrated in the western regions, leaving indigenous, island, and highly remote areas with inadequate medical coverage. Alarmingly, 87.5% of townships in these regions fail to meet the World Health Organization (WHO) standard for doctor-to-population ratios, with some areas lacking even a single physician.
Scalability Across Multiple Levels
Our system is designed to serve users ranging from government agencies to local hospitals, small clinics, and personalized diagnostic services. This scalability ensures effective integration at all levels of the healthcare system, enhancing resource distribution and access.
Healthcare Challenges
Distance to Medical Resources: Many residents must travel over an hour to reach the nearest hospital, posing severe challenges in emergencies and increasing patient suffering due to prolonged travel times.
Insufficient Medical Resources: Scarcity of medical supplies, equipment, and personnel, combined with chronic funding shortages, hampers diagnosis and treatment.
Lack of Medical Personnel: There is a critical shortage of medical personnel, including doctors, nurses, pharmacists, and technicians, leading to incomplete care or delays.
Cultural and Social Differences: Unique cultural and social practices, including language barriers, hinder effective communication and interaction between healthcare providers and patients, affecting diagnosis and treatment quality.
Relief for Taiwan's Healthcare System
By implementing our system, the burden and costs on Taiwan's National Health Insurance (NHI) system can be significantly reduced. Our solution streamlines symptom evaluation and department classification, allowing the NHI system to operate more efficiently and allocate resources more effectively.
Given these challenges, our GenAI-based Real-time Symptom Evaluation System is designed to address the severe disparities in healthcare resource distribution in Taiwan. By leveraging advanced AI and cloud-based GPU resources, specifically the AMD Accelerator Cloud - MI210, our system aims to provide immediate, accessible, and efficient medical guidance to underserved regions.
This block diagram illustrate our system architecture for deploying and serving the language model using AMD GPU clusters. ADRA-MED is composed by the three components:
Gradio Interface
We built a user-friendly web interface with Gradio, allowing users to interact with our language model. The interface offers intelligent prompt suggestions, symptom diagnosis, tailored health advices, and medical department recommendations. Users can choose between chat mode for natural conversations and urgent mode for emergency assistance.
vLLMEngine
vLLM is a high-performance serving engine for fast LLM inference and serving. By employing PagedAttention and continuous batching techniques, vLLM excels in rapidly processing LLM requests.
PagedAttention divides the Key-Value (KV) cache into manageable segments, significantly reducing memory consumption. As we know, memory is the primary performance bottleneck for LLM serving. This optimization allows GPUs efficiently manages memory utilization, enableing LLM systems to handle larger input sequences and more complex tasks. Continuous batching further enhances efficiency by dynamically grouping incoming requests, minimizing latency and maximizing throughput.
To unlock the power of AMD's GPU architecture, vLLM is integrated in the backend of ADAR-MED system. In tandem with vLLM, the system demonstrates an exceptional token processing speed, reaching a peak of 6039.68 tok/s for summarization and 2207.36 tok/s for generation when handling 64 concurrent requests.
The throughput of a server system is correlated to the number of user requests it handles. However, there's a point beyond which increasing it further can adversely affect the latency. This point is what we care about, as our system needs to server as many users as possible. By leveraging the perfect synergy between AMD GPUs and vLLM, ADAR-MED demonstrates the capability to simultaneously serve 768 online users while maintaining a response time aligned with average human reading speed with just one AMD Instinct™ MI210 Accelerator.
Controller
The ADAR-MED chatbot system employs a centralized controller to orchestrate distributed workers. Building upon the innovative architecture from the Fastchat [3] community, this controller manages communication between clients and workers, assigning tasks and distributing workloads efficiently. Each model worker is dedicated to hosting a single LLM model on a GPU, ensuring optimal resource utilization. By coordinating web servers and model workers, the controller optimizes system performance and scalability.
5. Web UI
GUI Design
This graphical user interface (GUI) is designed for a Medical ChatBot application that assists users with inquiries related to symptoms, suggestions, and specialist recommendations. The interface is clean and user-friendly, featuring a structured layout with intuitive controls and options for user interaction.
The Chatbot Button
We have two modes:
- Chat Mode: Used for everyday situations, allowing you to seek answers through conversation.
- Urgent Mode: Suitable for emergencies, where you can enter keywords to quickly generate a response.
Especially in Urgent Mode, it can provide real-time assistance to patients and medical personnel in more critical situations during busy times at medical facilities.
6. Flow ChartThis flow chart outlines the process of symptom evaluation and department recommendation.
Step 1: Describe in detail the symptoms you are currently experiencing.
Step 2: The system will provide recommendations based on your symptoms.
Step 3: The system will suggest which medical department you should visit.
MedAlpaca-7B
Medalpaca-7B [4] is a large language model specifically fine-tuned for medical domain tasks. It is based on LLaMA and contains 7 billion parameters. The primary goal of this model is to improve question-answering and medical dialogue tasks.
The following figure shows the architecture of the network:
Basic Architecture of MedAlpaca
Transformer Structure
- Encoder and Decoder are the two main components of the Transformer model. The encoder converts the input sequence into high-dimensional representations, while the decoder transforms these representations into the output sequence.
- MedAlpaca is based on LLaMA (Large Language Model Meta AI), which is a decoder-only architecture.
Self-Attention Mechanism
- Self-Attention is one of the core mechanisms of the Transformer model. It allows the model to consider the influence of other words in the sequence while processing each word, capturing relationships and contextual information between words
Medical Meadow[5] is a collection of medical tasks designed for fine-tuning and evaluating large language models in medicine. It includes two main categories: established medical NLP tasks reformatted for instruction tuning and crawled data from various internet resources.It includes following datasets.
- Flash Cards Used by Medical Students: Question-answer pairs from Anki Medical Curriculum flashcards, covering a broad range of medical topics, restructured using GPT-3.5-Turbo.
- Stackexchange Medical Sciences: 52, 475 question-answer pairs from Stack Exchange forums on topics like Academia, Bioinformatics, Biology, Fitness, and Health, with data from highly-rated answers.
- Wikidoc: Medical question-answer pairs from WikiDoc, including data from the "Living Textbook" and "Patient Information" sub-sites.
- Medical NLP Benchmarks: Data from various sources including CORD-19, Measuring Massive Multitask Language Understanding, MedQA benchmark, Pubmed Causal Benchmark, medical forums, and the OpenAssistant dataset.
- ChatDoctor[6]: consist of 200, 000 question-answer pairs.
This chapter presents the evaluation results of our model on the Open LLM Leaderboard, followed by a detailed analysis of the server system's throughput performance. The combined findings offer insights into both the model's capabilities and the system's capacity to handle concurrent users.
9-1. Open LLM Leaderboard Evaluation Results
The following plot displays the performance metrics of MED-Alpaca evaluated on the Open LLM Leaderboard.
9-2. Throughput Analysis of the Server System
The following data was collected to measure the performance of the summarization and generation stages in terms of tokens processed per second (T/s) across multiple online users.
Summarization Stage
Summarization stage involves condensing a long piece of text into a shorter version while preserving the key information.
Generation Stage
Generation stage involves creating new text based on a given prompt or input, where the model predicts and generates tokens one after another.
Analysis
The performance measurements for the summarization and generation stages show that the number of tokens processed per second (T/s) varies across multiple online user, demonstrating the GPU's computational capability. Notably, the highest token processing rates observed are 6039.68 T/s for the summarization stage and 2207.36 T/s for the generation stage when the system parallel serves 64 online users.
These peak values highlight the GPU's ability to handle significant computational loads, confirming its suitability for real-time symptom evaluation in medical applications. Specifically, the high token throughput indicates that the AMD Instinct™ MI210 Accelerators can efficiently manage the intensive computations required for both summarizing patient data and generating appropriate medical recommendations.
Upon further analysis with additional data, we observed even more remarkable performance metrics:
From this expanded measurement, the highest token processing rates reached 2810.88 T/s for the summarization stage and 2219.52 T/s for the generation stage when the system parallel serves 768 online users. These results underscore the GPU's ability to sustain extremely high computational loads, far exceeding the initial observations.
Key Observations
- Peak Performance: The GPU achieved peak performance of 2810.88 T/s during the summarization stage and 2219.52 T/s during the generation stage, showcasing its powerful processing capabilities.
- Consistent High Throughput: Multiple instances of high throughput (above 2000 T/s) were recorded, particularly in the generation stage, confirming the GPU's robust and reliable performance across various batch sizes.
- Variable Load Handling: The variation in token processing rates, with some batches showing lower throughput, indicates the GPU's ability to handle fluctuating workloads efficiently.
The system's high performance ensures that it can be effectively deployed in both small clinics and large-scale government operations:
- Small Clinics: The high token processing rates mean that small clinics can utilize this system to provide quick and accurate symptom evaluations, enhancing patient care without overwhelming the clinic's resources.
- Government Agencies: The system's ability to handle extremely high computational loads makes it ideal for large-scale deployment by government agencies. It can support massive volumes of data processing and symptom evaluations, ensuring that public health initiatives and emergency responses are well-supported.
Thus, our system is versatile and scalable, capable of meeting the demands of various healthcare settings, from individual clinics to large government entities, thereby improving overall healthcare accessibility and efficiency.
11. How To Run The ProjectThe following tutorial is for executing the system on AMD accelerate cloud.
11-1. Setup environment (Install package)
(a) Install anaconda and create an environment with Python version >= 3.11.
$ conda create -n your_env python=3.11
$ conda activate your_env
(b) Install vLLM: Refer to this website for more details.
- Install PyTorch of ROCm version >= 6.0.
$ pip3 install torch --index-url https://download.pytorch.org/whl/rocm6.0
- Install Triton flash attention for ROCm.
$ git clone https://github.com/ROCmSoftwarePlatform/triton.git
$ cd triton
$ git checkout triton-mlir
$ cd python
$ pip3 install ninja cmake;
$ pip3 install -e .
$ python setup.py install
$ cd ../..
- Download and build vLLM.
$ git clone https://github.com/vllm-project/vllm.git
$ cd vllm
$ pip install -U -r requirements-rocm.txt
$ python setup.py install
$ cd ..
(c) Install ADAR-MED that's built on FastChat v0.2.36
$ git clone git@github.com:kai-0430/ADAR-MED.git
$ cd ADAR-MED
$ pip3 install --upgrade pip
$ pip3 install -e ".[model_worker,webui]
11-2. Execute
We record a demo video for executing ADAR-MED.
A total of three terminal windows will be used. On the AMD accelerate cloud, we use the host ip 127.0.0.1
.
Terminal 1: activate the controller
$ python3 -m fastchat.serve.controller --host 127.0.0.1
Parameters:
- --host: specify the host ip
- --port: specify the port
Terminal 2: activate the vLLM engine
$ python3 -m fastchat.serve.vllm_worker --host 127.0.0.1 --model-path /path/to/med_alpaca --max-num-seqs 768
Parameters:
- --host: specify the host ip
- --model-path : the location of the language model
- --max-num-seqs : the maximum number of simultaneous requests to the model
Terminal 3: activate ADAR-MED webuser interface
$ python3 -m fastchat.serve.med_chabot_web_server --host 127.0.0.1 --share
Parameters:
- --host: specify the host ip
- --share : whether to open a public URL
You will receive a public URL to access ADAR-MED online. Simply copy and paste the URL to your browser, and you can start using ADAR-MED.
Improved Access to Medical Guidance: Our system offers a real-time evaluation of symptoms, allowing residents to receive immediate recommendations on appropriate medical departments without the need for long-distance travel.
Resource Optimization: By accurately classifying symptoms and directing patients to the appropriate medical departments, our system helps minimize unnecessary medical visits, optimizing the use of limited medical resources.
Enhanced Support for Medical Personnel: The system supports healthcare professionals by easing the decision-making process, particularly in understaffed and high-pressure environments, thereby improving the overall quality of care.
Breaking Barriers for All: Our chatbot system is crafted to be simple and user-friendly, reducing usage barriers for the elderly or those not adept with electronic devices. This ensures that more people can benefit from timely medical advice.
13. Reference[1] Package:Gradio GitHub
[2] Package: vLLM GitHub
[3] Package:FastChat GitHub
[4] Model: MedAlpaca-7B
[5] Dataset: Medical Meadow
[6] Dataset: ChatDoctor
[7] MI-210 Document: AMDAcceleratorCloudGuides
[8] MI-210 User Guide: Study Guide
Comments