This project aims to fine-tune an open-source large language model (LLM) using a custom dataset to develop specific expertise in a particular field. In this case, the objective is to create a medical LLM using a specialized dataset. The LLM will be fine-tuned with an English dataset that contains real recorded dialogues between patients and doctors. After data preprocessing, the dataset comprises 100 number of cases, with the patients' ages ranging between 60 and 92. The dataset was acquired from the following link: Diagnoise me (kaggle.com). This medical LLM will be especially suitable for elderly users, as it was trained using samples from elderly patients.
Problem Identification and BenefitsHealth-related issues are a major concern and expense among elderly populations. Accessing real human doctors requires appointments, which can sometimes take months to schedule. In contrast, a medical LLM is accessible 24/7 at minimal cost and without location restrictions. Additionally, while human doctors have limited knowledge based on their experiences, a medical LLM can be trained with an unlimited number of cases, providing a comprehensive and constantly updated knowledge base.
Preprocessing of dialogue datasetData integrity: Since the dataset contained duplicate entries, those duplicates were removed to prevent model bias. This process involved using algorithms to accurately identify and eliminate duplicate data. Additionally, columns with missing values were deleted to maintain the stability of the dataset. Removing these columns ensured data consistency and integrity, thereby improving the model's prediction accuracy.
Data quality: The dataset contained irrelevant information (such as nonsensical sentences and URLs), which was filtered out. Specifically, regular expressions and text cleaning libraries were used to filter out unnecessary information. Spelling errors and abbreviations (e.g., "u" instead of "you") were also detected and corrected. This ensured data consistency, allowing the model to learn more accurately. The correction of spelling errors and abbreviations involved natural language processing techniques to understand the context and make appropriate corrections.
Data accuracy: The model aims to provide medical advice to the elderly, so only dialogue data from a specific age group (60 to 90 years old) in the dataset was used. This involved filtering data based on age and selecting data specific to the target age group. This approach enabled the model to provide more accurate and effective advice to the specific user group. Additionally, the accuracy of metadata related to age was verified and any inaccurate data was excluded.
Privacy and ethic: If the dataset contained words that might violate compliance standards, the corresponding dialogue data was excluded. Filtering mechanisms were implemented to protect data privacy and adhere to legal and ethical standards. Specifically, a system was built to automatically identify and remove sensitive information or personally identifiable information. Furthermore, user consent for data usage was obtained and only ethically sound data was used, thus maintaining the dataset's reliability and legal compliance.
Explanation of the code!pip install gradientai
This command installs the gradientai library, which provides functionality for interacting with the Gradient AI platform.
import pandas as pd
from gradientai import Gradient
Pandas is a python library for data manipulation and analysis. It is used here to handle and process data in tabular form. Additionally, Gradient is a class from the gradientai library that allows you to interact with Gradient AI's model training and deployment features.
csv_file_path = "medical_dialog.csv"
df = pd.read_csv(csv_file_path)
The 'csv_file_path' variable holds the path to the CSV file containing the data. The CSV file is then read into a DataFrame. And tabular data structure is provided by pandas.
df_extracted = df[['Doctor', 'Patient']]
A new DataFrame containing only the 'Doctor' and 'Patient' columns from the original DataFrame is created.
formatted_data = []
for index, row in df_extracted.iterrows():
entry = {
"inputs": f"### Instruction: {row['Patient']}\n\n### Response: {row['Doctor']}"
}
formatted_data.append(entry)
'Formatted_data = [ ]' facilitates an empty list where each item will be a dictionary. df_extracted.iterrows() iterates over each row of the DataFrame. 'entry' is a dictionary where:
"inputs": Contains a string formatted to specify the instruction (patient's query) and the appropriate response (doctor's advice).'formatted_data.append(entry)' Adds the dictionary to the formatted data list.
def chunk_data(data, chunk_size):
for i in range(0, len(data), chunk_size):
yield data[i:i + chunk_size]
'chunk_data' is a generator function that splits data into chunks of size 'chunk_size'. 'yield' returns a chunk of the data without storing the entire result in memory, which is efficient for handling large datasets.
import os
os.environ['GRADIENT_WORKSPACE_ID'] = '9b5ba3df-2f43-4838-95f4-8aed2f358fe1_workspace'
os.environ['GRADIENT_ACCESS_TOKEN'] = 'FqCi9dPZgDMbMV7hIG59s529c99Uy4KV'
'os.environ' is used to set environment variables for 'GRADIENT_WORKSPACE_ID' and 'GRADIENT_ACCESS_TOKEN', which are needed to authenticate and interact with the Gradient AI platform.
def fine_tune_model(samples, access_token, workspace_id, base_model_slug="nous-hermes2", model_name="HHmodel", epochs=2, chunk_size=50):
gradient = Gradient(access_token=access_token, workspace_id=workspace_id)
base_model = gradient.get_base_model(base_model_slug=base_model_slug)
new_model_adapter = base_model.create_model_adapter(name=model_name)
print(f"Created model adapter with id {new_model_adapter.id}")
sample_query = "### Instruction: I am always tired and I cough blood, why? \n\n ### Response:"
print(f"Asking: {sample_query}")
# Before Finetuning
completion = new_model_adapter.complete(query=sample_query, max_generated_token_count=100).gener ated_output
print(f"Generated(before fine tuning): {completion}")
for epoch in range(epochs):
print(f"Fine tuning the model with iteration {epoch + 1}")
for chunk in chunk_data(samples, chunk_size):
new_model_adapter.fine_tune(samples=chunk)
# After fine tuning
completion = new_model_adapter.complete(query=sample_query, max_generated_token_count=500).generated_output
print(f"Generated(after fine tuning): {completion}")
new_model_adapter.delete()
gradient.close()
The function fine_tune_model
takes several parameters including samples
(the data to be used for fine-tuning), access_token
(for authentication), workspace_id
, base_model_slug
(identifying the base model to be used), model_name
(the name for the new fine-tuned model), epochs
(number of training cycles), and chunk_size
(size of data chunks for each fine-tuning step).
To create and prepare the model, an instance of Gradient
is created with the provided access_token
and workspace_id
. The base model, identified by base_model_slug
, is fetched, and a new model adapter for fine-tuning is created using this base model, named according to model_name
. A message confirming the creation of this model adapter is printed.
For the pre-fine-tuning testing, the function tests the new model adapter by generating a response to a predefined query (sample_query
) before any fine-tuning occurs. The generated response is printed out to serve as a baseline for comparison. During the fine-tuning process, the function then enters a loop to perform fine-tuning for the specified number of epochs. For each epoch, it iterates over the samples
in chunks of size chunk_size
and applies fine-tuning on these chunks using the model adapter. After completing the fine-tuning process, the function generates a response to the same query (sample_query
) using the fine-tuned model. This new response is printed to observe the effect of fine-tuning.
Lastly, the model adapter is deleted to clean up resources, and the connection to Gradient is closed.
if __name__ == "__main__":
access_token = os.environ.get('GRADIENT_ACCESS_TOKEN')
workspace_id = os.environ.get('GRADIENT_WORKSPACE_ID')
fine_tune_model(samples=formatted_data, access_token=access_token, workspace_id=workspace_id)
Execution of the program.
Future improvementExpanding the Dataset: Increasing the size and diversity of the dataset with more patient-doctor dialogues, including various medical conditions and scenarios, can enhance the model's generalization and accuracy.
Multilingual Support: Adding support for multiple languages would make the LLM accessible to non-English speaking elderly populations, broadening its reach and utility.
Integration with Wearable Devices: Integrating the LLM with wearable health devices can provide real-time health monitoring and personalized advice based on current health data.
Comments