•

Published July 31, 2024 © GPL3+

A. MD : Streamlining Healthcare Through Smart EHR

AI platform for healthcare simplifies EHRs by summarizing patient data from HL7 standards, improving treatment efficiency and care.

AdvancedFull instructions provided10 days287

A. MD : Streamlining Healthcare Through Smart EHR

Things used in this project

Hardware components

AMD Radeon PRO W7900

MSI MPG A1000G 80+G ATX3 PSU

ASUS ROG STRIX B650E-F GAM WF

AMD AMD RYZEN 9 7900X WO COOL

Software apps and online services

Microsoft VS Code

Ollama

Elastic

AMD ROCm™ Software

wandb

Meta Lamma3.1

SQL Alchemy

Fast API

Polars

React JS

Material UI

Docker

Panasonic Torchtune

Hand tools and fabrication machines

iFixit

LIANLI O11 VISION BLACK ATX TG

Story

Abstract

A.MD (A Doctor of Medicine), is a patient summary application which efficiently handles 1.59 million synthetic uniquepatient records. This application aims to provide summaries of recent visits, pattern analysis, prediction and care plan suggestions to doctor for informed decision making. Additionally, it has Health care chat interface where healthcare providers can able to query about anything in health care domain. Leveraging The AMD Radeon PRO W7900 GPU with ROCm, the application supports locally served large language models, making sure in robust performance regardless of handling huge data and protecting patient privacy.

Introduction

The main motivation behind the development of the patient summary application is to show how easy it is to integrate EHRs with LLMs to create precise summaries for the healthcare provider.

This application handle vast amounts of health care data with ease and simplicity utilizing elastic and RAG. By using these techniques, the application can effectively infer the latest Llama 3.1 models (8B and 70B parameters). The AMD Radeon PRO W7900 GPU with 48GB memory is the heart of the application, and ROCm technology enhances ability to load larger models, enabling faster inference and accurate summaries.

Why AMD GPUs for Healthcare Application?

Because they are really pretty to look at.

Image by Author

1. Efficiency and performance : The W7900 with 48 GB offers exceptional performance serves high throughput and low latency, ideal for real-time data processing like patient summaries.

2. Rocm software stack : ROCm (Radeon Open Compute) 6.1.3 enhances the performance and capabilities of AMD GPUs for AI which significantly improves the ability to run LLMs efficiently.

3. Cost Effectiveness : The substantial memory capacity features allow handling extensive healthcare data without the need for expensive cloud resources.

4. Data Privacy: Running AI models locally ensures sensitive patient data remains within the healthcare facility. It complies with healthcare regulations and protects patient information.

5. Industry-Grade : Radeon PRO W7900 GPUs are designed for professional and industrial use. Built on the advanced RDNA 3 architecture, provides > 2x AI performance per compute unit, making them ideal for demanding applications.

Full workstation Setup. Its eye candy for soar eyes.

Rocm Installation in Ubuntu

Follow the document for installation it uses 6.1.2 Rocm version. The following steps are for 6.1.3 version

Update and Install Required Packages

sudo apt update
sudo apt install "linux-headers-$(uname -r)" "linux-modules-extra-$(uname -r)"
sudo usermod -a -G render,video $LOGNAME

Install AMD GPU Package

wget https://repo.radeon.com/amdgpu-install/6.1.3/ubuntu/jammy/amdgpu-install_6.1.60103-1_all.deb
chmod 777 ./amdgpu-install_6.1.60103-1_all.deb
sudo apt install ./amdgpu-install_6.1.60103-1_all.deb

Install ROCm

sudo apt update
sudo apt install amdgpu-dkms rocm

Configure ROCm Use Case

sudo amdgpu-install --list-usecase
sudo amdgpu-install --usecase=rocm

Test with rocm-smi

1 / 2

Local Finetune with Torchtune

Torchtune offers support for ROCm which ran in my local without any errors

Clone TorchtuneRepo

git clone https://github.com/pytorch/torchtune.git
cd torchtune

Setup virtual env

python3 -m venv venv
source venv/bin/activate

Install PyTorch with ROCm

pip3 install --pre torch torchvision torchaudio --index-url https://download.pytorch.org/whl/nightly/rocm6.1/

Install WandB

pip3 install wandb
wandb login

Build and Install Torchtune

pip3 install build
python3 -m build
pip3 install dist/*.whl

Run Finetuning

tune run lora_finetune_single_device --config llama3/8B_lora_single_device \
metric_logger.component=torchtune.utils.metric_logging.WandBLogger \
metric_logger.project=llama-tune \
profiler.enabled=True

1 / 3 • Tuning in progress

Project Architecture Overview

The project consists of user-friendly UI but has a complex backend. A brief overview of architecture before delving into details

Deep dive architecture diagram

Architecture Diagram

Preprocessing the Data

SyntheticFHIR files totals 161.2 GB of patient data, covering 1.59 million unique patients. Performed various preprocessing steps to get the data in desired format.

React UI

Provides an interface for doctors to interact with the application. Enables searching for patients, viewing their latest visit summaries, summaries for various categories, and chatting with the chat agent.

WebSockets Stream

Facilitates two-way communication between the UI and the backend.

handled stream of tokens in ui using websockets

Backend

Built on Python using FastAPI. Handles API requests from the UI and manages WebSocket interactions for real-time connections.

Retriever

Quickly retrieves patient information, including recent visits and specific category history. Stores and indexes 1.59 million patient records in Elasticsearch for efficient data retrieval, summarization, and analysis.

Augment

Uses prompt templates as boilerplates for different queries. Augments response data and sends the whole query to the LLM for summary generation.

Docker Containers

Ollama ROCm Version: Serves LLMs via API for local LLMs.
Elasticsearch: Contains all patient information, including history, enabling fast and efficient searches.

Weights and Biases

Logs GPU metrics form rocm-smi while processing the request. Monitors token generation, performance and optimal GPU utilization

Deep Dive into Project

Entire project Video Demo is provided in the end of document

Data Gathering

After exploring various resources, none provided with large amount of data which I need for my application. Finally found Synthea which provides large simulated patient data in HL7® FHIR® (Fast Healthcare Interoperability Resources 1 ) format.

Preprocessing:

The FHIR files downloaded from synthea totaling 161 GB with whooping 1.5 million patients

Lets take close look at how FHIR file look like, lots of fields, ids, urls mostly unnecessary information for our application standpoint

synthetic FHIR format

clean data is needed to generate relevant data, so needed cleaning of FHIR unwanted fields, then saved into csv format. Entire patients data is divided in to 12 chunks each chunk have the 9 different categories. To get entire patient data with related categories all at one place is a very slow process hence further processing is necessary.

Total no.of records from each table

['allergies', 'careplans', 'conditions', 'encounters', 'immunizations', 'medications', 'observations', 'patients', 'procedures']

csv format

PreviousRevisions of preprocessing:

Various revisions were performed before finalizing to current specific format

Version 1:

Initially started with tables, attempting to render each patient's information and search other tables for related data.

This approach took ~ 20-30 seconds to return results, which is not suitable for real-time applications

Find this work in version_1 folder in doctorapp

prompt += "\nMedications:\n"
    for medication in patient_data['medications']:
         prompt += f"- Code:{medication['CODE']} Description: {medication['DESCRIPTION']} (Start: {medication['START']}, Stop: {medication['STOP']}, Reason Code: {medication['REASONCODE']} Reason: {medication['REASONDESCRIPTION']})\n"

version 2: Tried a different approach to retrieve patient information, but the results were not accurate.

Since the data was also stored in tables, it was difficult to access records to create embeddings.

Review this work in the version_2 folder in the doctorapp

Final Version

For quick retrieval of the patient information its idea to group the each patients info by category and keep them in single indexed ID. Due to large amount of data multi processing to be used to finish importing patient data.

full code is provided in github https://github.com/divyachandana/doctorapp/blob/main/data_preprocess/export_all_data.py

with ThreadPoolExecutor(max_workers=20) as executor:  
    future_to_patient_id = {executor.submit(transform_data, pid): pid for pid in patient_ids}
    transformed_data = []
    for future in as_completed(future_to_patient_id):
        ..............

Nearly took 21 hours to import all the patient information in elastic search

Batch processed in 46.81 seconds
Cumulative total time: 76075.25 seconds
Total processing time: 76075.25 seconds

patients meta information can retrieve with following url

http://localhost:9200/patients_meta/_search?pretty

patients full information can retrieve with following url

http://localhost:9200/patients/_search

final preprocessed data

Backend

Fast API is used in this application and serves via uvicorn. Integration of LLM server Ollama is straight forward to fast api's either API endpoint or Websockets

Ollama LLM server

Follow the below steps to install ollama server in local using docker container which compatible with rocm

docker run -d --device /dev/kfd --device /dev/dri -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama:rocm
docker exec -it ollama ollama pull llama3.1:70b-instruct-q4_0

-device /dev/kfd:(kernel fusion driver) Adds the host gpu device /dev/kfd to the container.
-device /dev/dri:(Direct Rendering Infrastructure) Adds the host gpu device /dev/dri to the container.
v ollama:/root/.ollama: Mounts the volume ollama to /root/.ollama inside the container.
p 11434:11434: Maps port 11434 on the host to port 11434 on the container.

You should see the output this way when running below endpoint in browser

http://localhost:11434/Ollama is running

Next step is to configure ollama in fast api for summary generation

This below ollama url will be used for text generation

http://localhost:11434/api/generate

stream True means the generated tokens will be sent to the client immediately without having to wait for completion. This is handled in the UI using WebSockets, which continuously maintains connection, updates UI with each stream of tokens received from the backend.

async def generate_summary(prompt):
url = "http://localhost:11434/api/generate"
payload = {
       "model": "llama3.1:70b-instruct-q4_0",
        "prompt": prompt,
        "stream": True,
}

Configuring ollama in fast api for chat

This url will be used for chatting with llm

http://localhost:11434/api/chat

async def generate_response(prompt):
url = "http://localhost:11434/api/chat"
payload = {
    "model": "llama3.1",
    "messages": [
        {
            "role": "system",
                    "content": "You are a helpful medical assistant. Your responses should be accurate, brief, and related to healthcare and medical topics."
           },
        {
            "role": "user",
            "content": f"{prompt}"
            }
        ],
        "stream": True,
        }

Elastic search

Docker setup for elastic search

run the below commands in the cmd prompt

export ELASTIC_PASSWORD="elastic"
docker network create elastic-net

docker run -d --name elasticsearch --network elastic-net \
  -p 127.0.0.1:9200:9200 \
  -e ELASTIC_PASSWORD=$ELASTIC_PASSWORD \
  -e "discovery.type=single-node" \
  -e "xpack.security.http.ssl.enabled=false" \
  -e "xpack.license.self_generated.type=trial" \
  docker.elastic.co/elasticsearch/elasticsearch:8.14.3

Access elastic here in the browser

http://localhost:9200
username : elastic
password: elastic

Configure Elasticsearch to handle different requests by modifying the queries accordingly. full code given here https://github.com/divyachandana/doctorapp/blob/main/elastic_search.py

from elasticsearch import Elasticsearch
es = Elasticsearch("http://localhost:920es = Elasticsearch(
        ['http://localhost:9200'],
          basic_auth=('elastic', os.getenv('ELASTIC_PASSWORD'))
        )

def query_elasticsearch(patient_id, category):
    return es.search(.....)

def patient_meta(query, size):
    return es.search(.....)

Configure elastic in fast api

Once Elasticsearch is initialized and queries are written, its very simple to just import the query functions and pass the patient ID and category names.

from elastic_search import query_elasticsearch, recent_info, patient_meta

During querying in Elasticsearch, it is important to cleanse the data by excluding unwanted fields and privacy concern fields.

This step is just to make sure we wont send the unwanted data to the LLMs therefore to avoid any irrelevant summary

"excludes": [
    "patient_info.DEATHDATE", "patient.ID","patient_info.SSN", "patient_info.PASSPORT", "patient_info.RACE",
        "allergies.PATIENT", "allergies.ENCOUNTER", "careplans.ID", "careplans.PATIENT", "careplans.ENCOUNTER",
    "conditions.PATIENT",...............]
},
"query": {
"term": {"_id": patient_id}
}

Configure websocket connections to handle stream of Data for visual representation

@app.websocket("/ws_generate_summary")
async def websocket_generate_summary(websocket: WebSocket):
    await websocket.accept()
    try:
        while True:
       data = await websocket.receive_text()
       ..........
       async for chunk in generate_summary(prompt):
               await websocket.send_text(chunk)

Prompt Templates

Augmenting quality data to the prompt and having a well-constructed prompt definitely improves the quality of the summary.

We gathered viewpoints from a few doctors,HealthCare Tech experts (they dont want me to put their name out or give credit for the help :-( they are concerned about privacy) and added them as features in the application, and also with the feedback given by them continuously refactoring the prompts based on the generated summaries.

full templates given here https://github.com/divyachandana/doctorapp/blob/main/prompt.py

def summarise_recent_health_info(json_data):
        return f"Provide a summary of the patient's recent health information     firs    t. Then, identify patterns, make predictions, and suggest care plans. Data:\n\n{json_data}"

def summarise_patient_info(json_data):
    return f"Provide a summary of the patient's general information. Data:\n\n{json_data}"

def summarise_patient_allergies(json_data):
    return f"Provide a summary of the patient's allergy information first. Then, identify patterns, make predictions, and suggest care plans. Data:\n\n{json_data}"

Made backend setup very easy regardless of complexity of serving multiple components. Follow the steps in this video and refer to the documentation

backend setup

React UI setup

Follow the readme file for installing the applicaiton https://github.com/divyachandana/patient_summary_ui

there are lot of files in ui I will walk through a important concept how chat connects to websockets while working

For chat the websocket for backend is ws://localhost:8000/ws once connection establishes then event handler logs the message to the output. This way it ensures real-time updates and a continuous flow of data, providing a seamless chat experience

const connectWebSocket = () => {
    websocketRef.current = new WebSocket('ws://localhost:8000/ws');
    websocketRef.current.onopen = () => {
    console.log('WebSocket connection established');
    if (reconnectTimeoutRef.current) {
        clearTimeout(reconnectTimeoutRef.current);
        reconnectTimeoutRef.current = null;
    }
};
websocketRef.current.onmessage = (event) => {
    const message = event.data;
    console.log('Received message chunk:', message);
    setMessages(prevMessages => {
    const lastMessage = prevMessages.length > 0 ? prevMessages[prevMessages.length - 1] : null;
    if (lastMessage && lastMessage.type === 'received' && !lastMessage.complete) {
// Update the last message's text by appending the new chunk
    const updatedText = lastMessage.text + message;
    return [...prevMessages.slice(0, -1), { ...lastMessage, text: updatedText }];
}
// Add the new message chunk as a new message
    return [...prevMessages, { text: message, type: 'received', timestamp: get    TimeStamp(), complete: false }];
    });
    };

Metrics comparison:

Wandb Librairy have some inbuilt mertics which logs as soon as we run the backen application, tracts local resources utilization, traffic, memory etc..,

I have customized some metrics and made use of the information which we are getting from ollama api. Once the query finishes it responds with lots of important information about the generated content.

# Log final metrics if the response is complete
if 'done' in chunk_data and chunk_data['done']:
    eval = (chunk_data['eval_duration'] / 1e9)
    tokens_per_second = chunk_data['eval_count'] / eval # eval_duration is in nanoseconds
gpu_utilization, gpu_temp, vram_usage, fan_speed, power_cap, mem_usage = get_gpu_metrics()
# Log final metrics to Weights & Biases
final_log_data = {
    "tokens_per_second": tokens_per_second,
    "gpu_utilization": gpu_utilization,
    "gpu_temp": gpu_temp,
    "vram_usage": vram_usage,
    "fan_speed": fan_speed,
    "power_cap": power_cap,
    "mem_usage": mem_usage,
    "total_duration_seconds": chunk_data['total_duration']/ 1e9,
    "load_duration_seconds": chunk_data['load_duration']/ 1e9,
    "prompt_eval_count": chunk_data['prompt_eval_count'],
    "prompt_eval_duration_seconds": chunk_data['prompt_eval_duration']/ 1e9,
    "eval_count": chunk_data['eval_count'],
    "eval_duration_seconds": chunk_data['eval_duration']/ 1e9
    }

In order to get the GPU metrics with ease, I have made use of subprocess, which retrieves GPU details from rocm-smi and then parses the data.

full code details to gather gpu metrics from rocm-smi is providede in this code https://github.com/divyachandana/doctorapp/blob/main/gpu_metrics.py

result = subprocess.run(['rocm-smi'], capture_output=True, text=True)
output = result.stdout
# Clean the output to remove escape sequences and unnecessary lines
cleaned_output = re.sub(r'\x1B[@-_][0-?]*[ -/]*[@-~]', '', output)
lines = cleaned_output.splitlines()
for line in lines:
..........
        temp = float(temp.replace("°C", ""))
        power = float(power.replace("W", ""))
        fan = float(fan.replace("%", ""))
        pwr_cap = float(pwr_cap.replace("W", "")) if pwr_cap != "Unsupported"             else None
        vram = int(vram.replace("%", ""))
        gpu = int(gpu.replace("%", ""))
        mem_usage = int(mem_usage.replace("%", ""))
    return gpu, temp, vram, fan, pwr_cap, mem_usage

The metrics below compares different model's various such as metrics latency, model load times, tokens per second, gpu consumption. I have used Medtronic, medllama2, llama3, llama, 3.1, lamma3.1 70b

Customized GPU metrics

Here are the detail metrics for various custom values goes from backend to bandb webpage. I have made comparisons with 4 different llms with their performance and with their summary outputs.

When comparing with different models summary outputs with the same patient category data each model gives different variations

This comparison includes human feed back from summary and numerical metrics

Comparision various metrics

Based on the metrics from llama3.1 8b and llama 3.1 70b its is necessary to evaluate which one to use if you see below outputs 8b model captures the necessary information providing good output in less time and loads quickly where as llama3.1 70 also captures necessary information and the output is very in detail with precision but the down part for this model is loading time and responding time is too slow.

1 / 5 • Customized GPU metrics

1 / 4 • Summary output llama3.1 70b

Main Demo

Application Demo

Conclusion

In conclusion, the A.MD (A Doctor of Medicine) summary application effectively integrates to synthetic EHR data to generate precise and detailed summaries. Leveraging the use of AMD gpu the application shows exceptional performance in handling Large patient datasets and running LLMs, providing high throughput and low latency. Furthermore, the ability to run LLMs locally ensure the protection of data privacy & compliance with healthcare regulations, negating the need for cloud and risk of data breaches.