Reading scientific papers and studies is difficult for most people. There is a lot of information in those studies that aren't always communicated with a clear understanding. Could you reasonably say one study is due to correlation or causation? (There was an article from 10 years ago that delved into this problem that's a fun read - https://fivethirtyeight.com/features/science-isnt-broken/). Is the study you're analyzing a study with a significant amount of people? How representative is that study for you personally and how does it compare to the populations of the other studies? If you're writing a paper analyzing these studies, should this one be included? Do you understand what the paper is actually communicating?
Those last two questions is the focus of this project. The goal of this project is to make a PC based solution which can be trained on PDFs that you provide it and so that you can ask questions from your PC to help analyze that data. I want it to be PC based instead of cloud based.
Large language models like ChatGPT or Google's Gemini are becoming a lot more popular for answering questions that you have. The main issue with these models is that they are inherently cloud based products and so any data that you upload to them will be stored on their servers and you don't necessarily have control. When the LLM is on your computer, you have complete control of your data and utilizing an AMD GPU like the Pro W7900, the speed of the local LLM will be far faster than anything you can get from a cloud service.
Requirements- Use Jupyter Notebook - Easy to read code and easy to change
- Use a PC based large language model
- Inputs are going to be PDFs
- Result is a chat that you can have to help understand that data
- As little of the internet as possible - if the internet goes down, you don't want to lose your work.
If you haven't done already, install the drivers for the graphic card - https://www.amd.com/en/support/download/drivers.html
If you have the AMD Radeon Pro W7900 GPU, use this link - https://www.amd.com/en/support/downloads/drivers.html/graphics/radeon-pro/radeon-pro-w7000-series/amd-radeon-pro-w7900.html (I'm using PRO Edition, 24.Q1.1)
Follow this link and follow AMD's instructions to install HIP SDK on Windows - https://rocm.docs.amd.com/projects/install-on-windows/en/latest/index.html#hip-install-quick
Installing a Local Large Language Model (LLM)I'm using a Windows PC (a Dell XPS 8950), Windows 11 Pro, an AMD Radeon Pro W7900 GPU installed.
To make it as easy as possible, we're going to use Ollama to install the required LLM. Ollama supports AMD graphics cards as of March 14, 2024. You can find if the GPU you have is supported here - https://ollama.com/blog/amd-preview but as of June, 2024, this was the list:
You can install Ollama on your PC https://ollama.com/ and then try it out using the code:
ollama run llama3 --verbose
This will download Llama 3 onto your PC (it is approximately 18GB so be aware). You can alternatively use Llama 2 instead but it will not be as intelligent of a chat bot.
Ask it to tell you a story and you should be able to see the GPU being utilized during the operation if you have the AMD Software: Pro Edition open.
If you don't have it already, install Anaconda from their website - https://www.anaconda.com/download/.
Open Anaconda Prompt and create a new environment:
conda create -n tesseract_env
This will create an environment and it will be located in:
C:\ProgramData\anaconda3\envs\tesseract_env
Activate the environment:
conda activate tesseract_env
Install pillow
conda install pillow
Install jupyter notebook
conda install notebook
install langchain and langchain-community
conda install langchain
conda install langchain-community
Install PyPDF2
conda install PyPDF2
Install PyMuPDF:
pip install PyMuPDF
Install OpenCV:
conda install opencv
Install ipywidgets:
conda install ipywidgets
Install Llama Index (but without the openAI stuff):
pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface
Install Tesseract (the program): https://github.com/UB-Mannheim/tesseract/wiki
When you install, make sure you install it to:
C:\ProgramData\anaconda3\envs\OCR_Env\Library\bin
Install tesseract:
conda install tesseract
Install pytesseract:
conda install pytesseract
You can verify the tesseract installation by typing into your open Anaconda Prompt:
tesseract --version
If it works, you should see the version information:
Navigate to where you would like to run your scripts (I put them in my documents folder):
cd C:\Users\USER_NAME\Documents\python\Stuff
And start a jupyter lab session:
jupyter lab
It will open a session in your default web browser for you to use:
From here you can create a new notebook and start programming.
Extracting Text from PDFsOpen Jupyter notebook and verify PyPDF2 works. (SHIFT+ENTER to run the cell).
import PyPDF2
If that results in an error, then troubleshoot the error and proceed.
From here, we're going to save this jupyter notebook and wherever you place it, you should create two folders - one called PDFs and one called TextFiles.
Drag the PDFs you want to process inside the PDFs folder and run the following code:
import os
import PyPDF2
def extract_text_from_pdfs(pdf_folder, output_folder):
# Ensure the output folder exists
os.makedirs(output_folder, exist_ok=True)
# Iterate over all PDF files in the specified folder
for filename in os.listdir(pdf_folder):
if filename.endswith(".pdf"):
pdf_path = os.path.join(pdf_folder, filename)
pdf_reader = PyPDF2.PdfReader(pdf_path)
# Create a filename for the output text file
output_filename = f"{os.path.splitext(filename)[0]}.txt"
output_path = os.path.join(output_folder, output_filename)
# Open the output text file in write mode
with open(output_path, "w", encoding="utf-8") as text_file:
# Extract text from each page and write it to the output file
for page_num in range(len(pdf_reader.pages)):
page = pdf_reader.pages[page_num]
text = page.extract_text()
# Write a header for each page
text_file.write(f"\n\n--- {filename}, Page {page_num + 1} ---\n\n")
text_file.write(text)
print(f"Extracted text from {filename}, page {page_num + 1}")
if __name__ == "__main__":
pdf_folder = "PDFs" # Folder containing the PDF files
output_folder = "TextFiles" # Folder where the text files will be saved
extract_text_from_pdfs(pdf_folder, output_folder)
This is also found in the jupyter notebook file attached to this project. What this does is it opens any PDFs available and goes through every page and extracts any text from it. It then concatenates those pages into a single text file and calls out the page number to indicate the breaks in the PDF (this is more for reference for the user than the LLM because LLMs get confused about how many pages there are in a document or where they are).
If you have handwritten notes that you want to train on, we can use tesseract and pytesseract to convert it to text. From a png image:
import pytesseract
from PIL import Image
# Path to the image file
image_path = 'test1.png'
# Open the image file
image = Image.open(image_path)
# Use pytesseract to do OCR on the image
text = pytesseract.image_to_string(image)
# Print the extracted text
print(text)
And if you have a pdf of a bunch of handwritten notes, you'll want to convert every page to text:
import io
import fitz # PyMuPDF
import pytesseract
from PIL import Image, ImageDraw, ImageFont
import cv2
import numpy as np
import os
def pdf_to_text_and_images(pdf_path, output_folder):
doc = fitz.open(pdf_path)
all_text = ""
for page_number in range(len(doc)):
page = doc.load_page(page_number)
pix = page.get_pixmap()
img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
# Perform OCR
text = pytesseract.image_to_string(img)
all_text += f"Page {page_number + 1}:\n{text}\n\n"
# Get bounding boxes for each line
boxes = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
# Create a new image with double width
new_img = Image.new('RGB', (img.width, img.height), color='white')
new_img.paste(img, (0, 0))
draw = ImageDraw.Draw(new_img)
# Draw green boxes around detected text areas
n_boxes = len(boxes['level'])
for i in range(n_boxes):
if boxes['text'][i].strip() and boxes['level'][i] == 5: # level 5 corresponds to lines
(x, y, w, h) = (boxes['left'][i], boxes['top'][i], boxes['width'][i], boxes['height'][i])
draw.rectangle([x, y, x + w, y + h], outline='green')
# Save the processed image
output_image_path = os.path.join(output_folder, f'output_image_page_{page_number + 1}.jpg')
new_img.save(output_image_path)
print(f"Processed page {page_number + 1}")
return all_text
# Usage
pdf_path = 'YOUR/PDF/FILE/NAME.pdf'
output_folder = 'YOUR/OUTPUT/FOLDER'
# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)
# Process all pages of the PDF
extracted_text = pdf_to_text_and_images(pdf_path, output_folder)
# Save the entire extracted text to a single file
output_text_path = os.path.join(output_folder, 'full_extracted_text.txt')
with open(output_text_path, 'w', encoding='utf-8') as f:
f.write(extracted_text)
print(f"Full extracted text saved to: {output_text_path}")
# Display the first processed image in the notebook
from IPython.display import Image
Image(filename=os.path.join(output_folder, 'output_image_page_1.jpg'))
This will take in the pdf you specify and output images of each page marked up as well as a text file with the combined information.
Running Llama from Jupyter NotebookLangchain is going to be used to connect to llama3 from jupyter notebook. Verify that it's installed correctly:
from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
And have it try giving you a story:
llm = Ollama(model="llama3", callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))
result = llm.invoke("Tell me a story in a thousand words?")
Langchain can be used to interact with Ollama and ask it to do whatever you need. It'll use the GPU which will result in very fast responses.
We want to train the model on our data and for that we're going to use llama_index.
import warnings
# Suppress specific FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")
from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama
# Load documents from the 'data' directory
documents = SimpleDirectoryReader("PDFs").load_data()
# Set the embedding model and LLM with specific settings
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
Settings.llm = Ollama(model="llama3", request_timeout=360.0)
# Create the index from the loaded documents
index = VectorStoreIndex.from_documents(
documents,
)
This will take the files within the PDFs folder (change if needed) and will process it using the llama3 model.
From here, we can set a query engine and ask it questions. This is not a conversational bot so it won't remember what you last said but it can be useful for getting summaries of what was given to it.
query_engine = index.as_query_engine(streaming=True)
while True:
question = input("Ask a question (or type 'exit' to quit): ")
if question.lower() == 'exit':
break
print("\n")
streaming_response = query_engine.query(question)
streaming_response.print_response_stream()
print("\n")
Testing it all outSo I'm going to use the National Healthcare Quality and Disparities Reports as an example - https://www.ahrq.gov/research/findings/nhqrdr/index.html
Being able to use your AMD GPU to train an LLM on your own data makes for a secure way to talk about your own data and could be expanded so you can have a chatbot that could write in your style while keeping all of your data local.
Comments
Please log in or sign up to comment.