Introduction
Requirements
Install AMD software
Installing a Local Large Language Model (LLM
Create an Environment and Installing Libraries
Extracting Text from PDFs
Extracting Text From Handwritten Notes
Running Llama from Jupyter Notebook
Testing it all out
Conclusion

Published October 12, 2024 © GPL3+

Analyzing Files With Llama 3 Locally On Your PC

Analyze documents locally on your PC and use a large language model so you can ask questions and get answers.

IntermediateFull instructions provided3 hours1,483

Analyzing Files With Llama 3 Locally On Your PC

Things used in this project

Hardware components

AMD Radeon™ PRO W7900

Software apps and online services

AMD Software: Pro Edition

Anaconda

Tesseract

Story

Introduction

Reading scientific papers and studies is difficult for most people. There is a lot of information in those studies that aren't always communicated with a clear understanding. Could you reasonably say one study is due to correlation or causation? (There was an article from 10 years ago that delved into this problem that's a fun read - https://fivethirtyeight.com/features/science-isnt-broken/). Is the study you're analyzing a study with a significant amount of people? How representative is that study for you personally and how does it compare to the populations of the other studies? If you're writing a paper analyzing these studies, should this one be included? Do you understand what the paper is actually communicating?

Those last two questions is the focus of this project. The goal of this project is to make a PC based solution which can be trained on PDFs that you provide it and so that you can ask questions from your PC to help analyze that data. I want it to be PC based instead of cloud based.

Large language models like ChatGPT or Google's Gemini are becoming a lot more popular for answering questions that you have. The main issue with these models is that they are inherently cloud based products and so any data that you upload to them will be stored on their servers and you don't necessarily have control. When the LLM is on your computer, you have complete control of your data and utilizing an AMD GPU like the Pro W7900, the speed of the local LLM will be far faster than anything you can get from a cloud service.

Requirements

Use Jupyter Notebook - Easy to read code and easy to change
Use a PC based large language model
Inputs are going to be PDFs
Result is a chat that you can have to help understand that data
As little of the internet as possible - if the internet goes down, you don't want to lose your work.

Install AMD software

If you haven't done already, install the drivers for the graphic card - https://www.amd.com/en/support/download/drivers.html

If you have the AMD Radeon Pro W7900 GPU, use this link - https://www.amd.com/en/support/downloads/drivers.html/graphics/radeon-pro/radeon-pro-w7000-series/amd-radeon-pro-w7900.html (I'm using PRO Edition, 24.Q1.1)

Follow this link and follow AMD's instructions to install HIP SDK on Windows - https://rocm.docs.amd.com/projects/install-on-windows/en/latest/index.html#hip-install-quick

Installing a Local Large Language Model (LLM)

I'm using a Windows PC (a Dell XPS 8950), Windows 11 Pro, an AMD Radeon Pro W7900 GPU installed.

To make it as easy as possible, we're going to use Ollama to install the required LLM. Ollama supports AMD graphics cards as of March 14, 2024. You can find if the GPU you have is supported here - https://ollama.com/blog/amd-preview but as of June, 2024, this was the list:

You can install Ollama on your PC https://ollama.com/ and then try it out using the code:

ollama run llama3 --verbose

This will download Llama 3 onto your PC (it is approximately 18GB so be aware). You can alternatively use Llama 2 instead but it will not be as intelligent of a chat bot.

Ask it to tell you a story and you should be able to see the GPU being utilized during the operation if you have the AMD Software: Pro Edition open.

Showing that the LLM is running off the GPU instead of CPU

Create an Environment and Installing Libraries

If you don't have it already, install Anaconda from their website - https://www.anaconda.com/download/.

Open Anaconda Prompt and create a new environment:

conda create -n tesseract_env

This will create an environment and it will be located in:

C:\ProgramData\anaconda3\envs\tesseract_env

Activate the environment:

conda activate tesseract_env

Install pillow

conda install pillow

Install jupyter notebook

conda install notebook

install langchain and langchain-community

conda install langchain
conda install langchain-community

Install PyPDF2

conda install PyPDF2

Install PyMuPDF:

pip install PyMuPDF

Install OpenCV:

conda install opencv

Install ipywidgets:

conda install ipywidgets

Install Llama Index (but without the openAI stuff):

pip install llama-index-core llama-index-readers-file llama-index-llms-ollama llama-index-embeddings-huggingface

Install Tesseract (the program): https://github.com/UB-Mannheim/tesseract/wiki

When you install, make sure you install it to:

C:\ProgramData\anaconda3\envs\OCR_Env\Library\bin

Install tesseract:

conda install tesseract

Install pytesseract:

conda install pytesseract

You can verify the tesseract installation by typing into your open Anaconda Prompt:

tesseract --version

If it works, you should see the version information:

tesseract --version (working correctly)

Navigate to where you would like to run your scripts (I put them in my documents folder):

cd C:\Users\USER_NAME\Documents\python\Stuff

And start a jupyter lab session:

jupyter lab

It will open a session in your default web browser for you to use:

Jupyter lab open

From here you can create a new notebook and start programming.

Extracting Text from PDFs

Open Jupyter notebook and verify PyPDF2 works. (SHIFT+ENTER to run the cell).

import PyPDF2

If that results in an error, then troubleshoot the error and proceed.

Showing no errors when running the block

From here, we're going to save this jupyter notebook and wherever you place it, you should create two folders - one called PDFs and one called TextFiles.

Folders in same location as jupyter notebook

Drag the PDFs you want to process inside the PDFs folder and run the following code:

import os
import PyPDF2

def extract_text_from_pdfs(pdf_folder, output_folder):
    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Iterate over all PDF files in the specified folder
    for filename in os.listdir(pdf_folder):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(pdf_folder, filename)
            pdf_reader = PyPDF2.PdfReader(pdf_path)

            # Create a filename for the output text file
            output_filename = f"{os.path.splitext(filename)[0]}.txt"
            output_path = os.path.join(output_folder, output_filename)

            # Open the output text file in write mode
            with open(output_path, "w", encoding="utf-8") as text_file:
                # Extract text from each page and write it to the output file
                for page_num in range(len(pdf_reader.pages)):
                    page = pdf_reader.pages[page_num]
                    text = page.extract_text()

                    # Write a header for each page
                    text_file.write(f"\n\n--- {filename}, Page {page_num + 1} ---\n\n")
                    text_file.write(text)

                    print(f"Extracted text from {filename}, page {page_num + 1}")

if __name__ == "__main__":
    pdf_folder = "PDFs"  # Folder containing the PDF files
    output_folder = "TextFiles"  # Folder where the text files will be saved
    extract_text_from_pdfs(pdf_folder, output_folder)

This is also found in the jupyter notebook file attached to this project. What this does is it opens any PDFs available and goes through every page and extracts any text from it. It then concatenates those pages into a single text file and calls out the page number to indicate the breaks in the PDF (this is more for reference for the user than the LLM because LLMs get confused about how many pages there are in a document or where they are).

1 / 2 • Showing an example output of code

Extracting Text From Handwritten Notes

If you have handwritten notes that you want to train on, we can use tesseract and pytesseract to convert it to text. From a png image:

import pytesseract
from PIL import Image

# Path to the image file
image_path = 'test1.png'

# Open the image file
image = Image.open(image_path)

# Use pytesseract to do OCR on the image
text = pytesseract.image_to_string(image)

# Print the extracted text
print(text)

And if you have a pdf of a bunch of handwritten notes, you'll want to convert every page to text:

import io
import fitz  # PyMuPDF
import pytesseract
from PIL import Image, ImageDraw, ImageFont
import cv2
import numpy as np
import os

def pdf_to_text_and_images(pdf_path, output_folder):
    doc = fitz.open(pdf_path)
    all_text = ""
    
    for page_number in range(len(doc)):
        page = doc.load_page(page_number)
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        
        # Perform OCR
        text = pytesseract.image_to_string(img)
        all_text += f"Page {page_number + 1}:\n{text}\n\n"
        
        # Get bounding boxes for each line
        boxes = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
        
        # Create a new image with double width
        new_img = Image.new('RGB', (img.width, img.height), color='white')
        new_img.paste(img, (0, 0))
        
        draw = ImageDraw.Draw(new_img)
        
        # Draw green boxes around detected text areas
        n_boxes = len(boxes['level'])
        for i in range(n_boxes):
            if boxes['text'][i].strip() and boxes['level'][i] == 5:  # level 5 corresponds to lines
                (x, y, w, h) = (boxes['left'][i], boxes['top'][i], boxes['width'][i], boxes['height'][i])
                draw.rectangle([x, y, x + w, y + h], outline='green')
        
        # Save the processed image
        output_image_path = os.path.join(output_folder, f'output_image_page_{page_number + 1}.jpg')
        new_img.save(output_image_path)
        
        print(f"Processed page {page_number + 1}")
    
    return all_text

# Usage
pdf_path = 'YOUR/PDF/FILE/NAME.pdf'
output_folder = 'YOUR/OUTPUT/FOLDER'

# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Process all pages of the PDF
extracted_text = pdf_to_text_and_images(pdf_path, output_folder)

# Save the entire extracted text to a single file
output_text_path = os.path.join(output_folder, 'full_extracted_text.txt')
with open(output_text_path, 'w', encoding='utf-8') as f:
    f.write(extracted_text)

print(f"Full extracted text saved to: {output_text_path}")

# Display the first processed image in the notebook
from IPython.display import Image
Image(filename=os.path.join(output_folder, 'output_image_page_1.jpg'))

This will take in the pdf you specify and output images of each page marked up as well as a text file with the combined information.

Running Llama from Jupyter Notebook

Langchain is going to be used to connect to llama3 from jupyter notebook. Verify that it's installed correctly:

from langchain.llms import Ollama
from langchain.callbacks.manager import CallbackManager
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler

And have it try giving you a story:

llm = Ollama(model="llama3", callback_manager = CallbackManager([StreamingStdOutCallbackHandler()]))
result = llm.invoke("Tell me a story in a thousand words?")

Creating a story, using the GPU

Langchain can be used to interact with Ollama and ask it to do whatever you need. It'll use the GPU which will result in very fast responses.

We want to train the model on our data and for that we're going to use llama_index.

import warnings

# Suppress specific FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama

# Load documents from the 'data' directory
documents = SimpleDirectoryReader("PDFs").load_data()

# Set the embedding model and LLM with specific settings
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
Settings.llm = Ollama(model="llama3", request_timeout=360.0)

# Create the index from the loaded documents
index = VectorStoreIndex.from_documents(
    documents,
)

This will take the files within the PDFs folder (change if needed) and will process it using the llama3 model.

From here, we can set a query engine and ask it questions. This is not a conversational bot so it won't remember what you last said but it can be useful for getting summaries of what was given to it.

query_engine = index.as_query_engine(streaming=True)
while True:
    
    question = input("Ask a question (or type 'exit' to quit): ")
    if question.lower() == 'exit':
        break
    print("\n")
    streaming_response = query_engine.query(question)
    streaming_response.print_response_stream()
    print("\n")

Testing it all out

So I'm going to use the National Healthcare Quality and Disparities Reports as an example - https://www.ahrq.gov/research/findings/nhqrdr/index.html

Discussion of the reports I gave the llm to read

Conclusion

Being able to use your AMD GPU to train an LLM on your own data makes for a secure way to talk about your own data and could be expanded so you can have a chatbot that could write in your style while keeping all of your data local.

import os
import PyPDF2

def extract_text_from_pdfs(pdf_folder, output_folder):
    # Ensure the output folder exists
    os.makedirs(output_folder, exist_ok=True)

    # Iterate over all PDF files in the specified folder
    for filename in os.listdir(pdf_folder):
        if filename.endswith(".pdf"):
            pdf_path = os.path.join(pdf_folder, filename)
            pdf_reader = PyPDF2.PdfReader(pdf_path)

            # Create a filename for the output text file
            output_filename = f"{os.path.splitext(filename)[0]}.txt"
            output_path = os.path.join(output_folder, output_filename)

            # Open the output text file in write mode
            with open(output_path, "w", encoding="utf-8") as text_file:
                # Extract text from each page and write it to the output file
                for page_num in range(len(pdf_reader.pages)):
                    page = pdf_reader.pages[page_num]
                    text = page.extract_text()

                    # Write a header for each page
                    text_file.write(f"\n\n--- {filename}, Page {page_num + 1} ---\n\n")
                    text_file.write(text)

                    print(f"Extracted text from {filename}, page {page_num + 1}")

if __name__ == "__main__":
    pdf_folder = "PDFs"  # Folder containing the PDF files
    output_folder = "TextFiles"  # Folder where the text files will be saved
    extract_text_from_pdfs(pdf_folder, output_folder)

import io
import fitz  # PyMuPDF
import pytesseract
from PIL import Image, ImageDraw, ImageFont
import cv2
import numpy as np
import os

def pdf_to_text_and_images(pdf_path, output_folder):
    doc = fitz.open(pdf_path)
    all_text = ""
    
    for page_number in range(len(doc)):
        page = doc.load_page(page_number)
        pix = page.get_pixmap()
        img = Image.frombytes("RGB", [pix.width, pix.height], pix.samples)
        
        # Perform OCR
        text = pytesseract.image_to_string(img)
        all_text += f"Page {page_number + 1}:\n{text}\n\n"
        
        # Get bounding boxes for each line
        boxes = pytesseract.image_to_data(img, output_type=pytesseract.Output.DICT)
        
        # Create a new image with double width
        new_img = Image.new('RGB', (img.width, img.height), color='white')
        new_img.paste(img, (0, 0))
        
        draw = ImageDraw.Draw(new_img)
        
        # Draw green boxes around detected text areas
        n_boxes = len(boxes['level'])
        for i in range(n_boxes):
            if boxes['text'][i].strip() and boxes['level'][i] == 5:  # level 5 corresponds to lines
                (x, y, w, h) = (boxes['left'][i], boxes['top'][i], boxes['width'][i], boxes['height'][i])
                draw.rectangle([x, y, x + w, y + h], outline='green')
        
        # Save the processed image
        output_image_path = os.path.join(output_folder, f'output_image_page_{page_number + 1}.jpg')
        new_img.save(output_image_path)
        
        print(f"Processed page {page_number + 1}")
    
    return all_text

# Usage
pdf_path = 'YOUR/PDF/FILE/NAME.pdf'
output_folder = 'YOUR/OUTPUT/FOLDER'

# Create output folder if it doesn't exist
os.makedirs(output_folder, exist_ok=True)

# Process all pages of the PDF
extracted_text = pdf_to_text_and_images(pdf_path, output_folder)

# Save the entire extracted text to a single file
output_text_path = os.path.join(output_folder, 'full_extracted_text.txt')
with open(output_text_path, 'w', encoding='utf-8') as f:
    f.write(extracted_text)

print(f"Full extracted text saved to: {output_text_path}")

# Display the first processed image in the notebook
from IPython.display import Image
Image(filename=os.path.join(output_folder, 'output_image_page_1.jpg'))

import warnings

# Suppress specific FutureWarning
warnings.filterwarnings("ignore", category=FutureWarning, module="huggingface_hub")

from llama_index.core import VectorStoreIndex, SimpleDirectoryReader, Settings
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.llms.ollama import Ollama

# Load documents from the 'data' directory
documents = SimpleDirectoryReader("PDFs").load_data()

# Set the embedding model and LLM with specific settings
Settings.embed_model = HuggingFaceEmbedding(model_name="BAAI/bge-base-en-v1.5")
Settings.llm = Ollama(model="llama3", request_timeout=360.0)

# Create the index from the loaded documents
index = VectorStoreIndex.from_documents(
    documents,
)

Credits

Alex Merchen

24 projects • 40 followers

I'm an EE with a Masters in ECE. I like building things.

Contact

Comments

Please log in or sign up to comment.

Analyzing Files With Llama 3 Locally On Your PC

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Requirements

Install AMD software

Installing a Local Large Language Model (LLM)

Create an Environment and Installing Libraries

Extracting Text from PDFs

Extracting Text From Handwritten Notes

Running Llama from Jupyter Notebook

Testing it all out

Conclusion

Code

PDFs to Text

Handwritten Notes to Text

Processing many Handwritten PDFs to Text

Training on Local Information

Discussing with Local LLM

Credits

Alex Merchen

Comments

Embed the widget on your own site

Analyzing Files With Llama 3 Locally On Your PC

Analyzing Files With Llama 3 Locally On Your PC

Things used in this project

Hardware components

Software apps and online services

Story

Introduction

Requirements

Install AMD software

Installing a Local Large Language Model (LLM)

Create an Environment and Installing Libraries

Extracting Text from PDFs

Extracting Text From Handwritten Notes

Running Llama from Jupyter Notebook

Testing it all out

Conclusion

Code

PDFs to Text

Handwritten Notes to Text

Processing many Handwritten PDFs to Text

Training on Local Information

Discussing with Local LLM

Credits

Alex Merchen

Comments

Related channels and tags