Preparing the data with GPT-4o model
Fine-Tuning a Llama3-8B Model on the NVIDIA Jetson AGX Orin
Inference
References

Published June 27, 2024 © GPL3+

AI Assistant for Harry Potter Fans using Nvidia Jetson board

Fine-tuning the Llama3-8B model on the NVIDIA Jetson AGX Orin Developer Kit using a Harry Potter Fanatic dataset generated with the ChatGPT

AdvancedFull instructions provided5 hours565

AI Assistant for Harry Potter Fans using Nvidia Jetson board

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Story

Fine-tuning is the most effective method for training a model on task-specific aspects. The process involves utilizing custom datasets to enhance the performance of a pre-trained model on particular tasks.

Today, I'll be fine-tuning the Llama3-8B model on the NVIDIA Jetson AGX Orin Developer Kit using a dataset generated specifically for Harry Potter fans. This dataset will be created using the GPT-4o model.

Preparing the data with GPT-4o model

We'll leverage a subset of the Simple Questions V2 dataset available on Hugging Face to feed into GPT-4o.

We'll first run a Redis server using Docker. This will enable us to queue and process tasks in the background.

docker run --name local-redis --network=host -d redis redis-server

RQ or Redis Queue is a simple Python library for queueing jobs and processing them in the background with workers.

Next, we'll install the required Python libraries:

pip install redis
pip install rq
pip install openai

We'll define a prompt template for GPT-4o

HARRY_POTTER_PROMPT_TEMPLATE = """
You are a die-hard fan of the Harry Potter series. Every question you are asked, you respond with a short, magical reference to the series.

Do not cite the book or chapter.
Answer briefly and enchantingly.
One or two sentences is good enough:

{question}
"""

The template ensures responses are concise and avoid directly citing the book or chapter.

Run the script generate.py to create your question-answer samples.

python3 generate.py

Then, launch a pool of worker processes that will handle the generated jobs.

rq worker-pool -n 10

I spent $10 and almost 30 minutes generating 13, 590 question samples. An example of a sample is:

{"question": "what are names of towns in japan\n", "answer": "\"Ah, the beauty of magic in Japan, like a Patronus casting light... Tokyo, Kyoto, Osaka, each a spellbinding location in the Muggle world!\" \ud83e\ude84\u2728"}

Combine all the generated samples from the dataset directory into a single file named data.jsonl using the following command:

cat ./dataset/* > data.jsonl

Next, we split the dataset into training and validation datasets:

head -n 10590 data.jsonl > train.jsonl
tail -n 3000 data.jsonl > val.jsonl

Extract the first 10, 590 lines for the training set. Then the remaining 3000 lines for the validation set.

To ensure the data is formatted correctly, we can use a code snippet like this (ideally in a Jupyter notebook) to view a sample entry:

chat_prompt = """
### Instruction:
{}


### Input:
{}


### Response:
{}"""
from datasets import load_dataset

def formatting_prompts_func(examples):
    instruction = ""
    inputs       = examples["question"]
    outputs      = examples["answer"]
    texts = []
    for input, output in zip(inputs, outputs):
        text = f"Instruction: {instruction}\nInput: {input}\nOutput: {output}"
        texts.append(text)
    return {"text": texts}
dataset = load_dataset('json', data_files='train.jsonl', split='train')
dataset = dataset.map(formatting_prompts_func, batched=True)

Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.

Show the first processed entry

print(dataset['text'][0])

This command displays the first processed entry from the training dataset:

Instruction: 
Input: Name the genre of the film odd thomas

Output: It's a muggle tale with shadows of the dark arts and spectral whispers! 🌑👻✨

By following these steps, we'll have a prepared Harry Potter Fanatic dataset ready to be used for fine-tuning the Llama3-8B model on the NVIDIA Jetson AGX Orin Developer Kit.

Fine-Tuning a Llama3-8B Model on the NVIDIA Jetson AGX Orin

In the following, we will load the Llama 3 8B model in 4-bit precision thanks to bitsandbytes and Dustin Franklin's jetson container projects. We then set the LoRA configuration using PEFTfor QLoRA

jetson-containers run -v /path/on/host:/path/in/container $(autotag bitsandbytes)

Install the necessary Python packages and libraries required for fine-tuning:

pip install peft
pip install trl
pip install wandb

Note that you need to submit a request to access meta-llama/Meta-Llama-3-8B and be logged in to your Hugging Face account. To download the model you have been granted access to, make sure you are logged in to the Hugging Face model hub using valid Hugging face token.

The below code defines the training arguments for fine-tuning Llama3-8B Model Large Language Model. Power mode of NVIDIA Jetson AGX Orin has been set to MAXN.

# 1. Importing and configurations 
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer
import wandb
import gc
from huggingface_hub import login
login(token="YOUR_HUGGING_FACE_KEY")

new_model = "Llama-3-HarryPotter"
max_seq_length = 2048

base_model = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True

EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN

chat_prompt = """
### Instruction:
{}


### Input:
{}


### Response:
{}"""


EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
   instruction = ""
   inputs       = examples["question"]
   outputs      = examples["answer"]
   texts = []
   for input, output in zip(inputs, outputs):
       # Must add EOS_TOKEN, otherwise your generation will go on forever!
       text = chat_prompt.format(instruction, input, output) + EOS_TOKEN
       texts.append(text)
   return { "text" : texts, }

train_dataset = load_dataset('json', data_files='train.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='val.jsonl', split='train')

train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)

compute_dtype = getattr(torch, "bfloat16")

# 2. Load Llama3 model
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=compute_dtype,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    base_model,
    quantization_config=bnb_config,
    device_map="auto",
    use_cache=False,
    trust_remote_code=True
)
model.config.use_cache = True 
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)

# 3 Before training
def generate_text(text):
    inputs = tokenizer(text, return_tensors="pt").to("cuda:0")
    outputs = model.generate(**inputs, max_new_tokens=20)
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))

print("Before training\n")
generate_text("<question>: What are the key differences between Python and C++?\n<answer>: ")

# 4. Do model patching and add fast LoRA weights and training
peft_config = LoraConfig(
    r=16,
    lora_alpha=32,
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
    target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)

# Monitering the LLM
wandb.login(key = "YOUR_WANDB_KEY")
run = wandb.init(project='Fine tuning of LLAMA3 8B', job_type="training", anonymous="allow")

trainer = SFTTrainer(
    model = model,
    train_dataset = train_dataset,
    eval_dataset = eval_dataset,
    dataset_text_field="text",
    peft_config=peft_config,
    max_seq_length = max_seq_length,
    tokenizer = tokenizer,
    args = TrainingArguments(
        per_device_train_batch_size = 1,
        per_device_eval_batch_size=16,
        gradient_accumulation_steps = 4,
        evaluation_strategy="steps",
        do_eval=True,
        eval_steps=25,
        save_steps=50,
        logging_steps = 25,
        max_steps = 100,
        output_dir = "outputs",
        optim = "adamw_8bit",
        learning_rate=2e-4,
        weight_decay=0.001,
        bf16=True,
        max_grad_norm=0.3,
        warmup_ratio=0.03,
        group_by_length=True,
        lr_scheduler_type="constant",
        report_to="wandb"
    ),
)
trainer.train()

# 5. After training
print("\n ######## \nAfter training\n")
generate_text("<question>: What are the key differences between Python and C++?\n<answer>: ")

# 6. Save the model
output_dir = "lora_model"
hf_model_repo = "fine-tuned_llama3-8b" 
trainer.save_model(output_dir)

output_dir = os.path.join(output_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
tokenizer.push_to_hub(hf_model_repo)

Once training process finishes, it’ll give you useful feedback on your training, validation loss, tokens, and iterations per second.

wandb: \ 0.024 MB of 0.024 MB uploaded
wandb: Run history:
wandb:               eval/loss ▁█▆▇
wandb:            eval/runtime ▁█▄▃
wandb: eval/samples_per_second █▁▅▆
wandb:   eval/steps_per_second █▁▁█
wandb:             train/epoch ▁▁▃▃▆▆███
wandb:       train/global_step ▁▁▃▃▆▆███
wandb:         train/grad_norm ▁▃█▆
wandb:     train/learning_rate ▁▁▁▁
wandb:              train/loss █▁▆█
wandb: 
wandb: Run summary:
wandb:                eval/loss 2.37321
wandb:             eval/runtime 1746.9307
wandb:  eval/samples_per_second 7.779
wandb:    eval/steps_per_second 0.487
wandb:               total_flos 941803809251328.0
wandb:              train/epoch 0.02944
wandb:        train/global_step 100
wandb:          train/grad_norm 7.31785
wandb:      train/learning_rate 0.0002
wandb:               train/loss 2.3481
wandb:               train_loss 2.26518
wandb:            train_runtime 7850.469
wandb: train_samples_per_second 0.051
wandb:   train_steps_per_second 0.013

Our model is now fine-tuned. It took nearly two hours and eleven minutes for 100 steps using the Nvidia Jetson AGX Orin with 64GB of shared memory.

We can check the project on Weights & Biases web interface.

Here are some interesting metrics to analyze. The graph below shows the loss over time for the validation data.

Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a Early Stopping mechanism during your fine-tuning. This will enable you to regularly test your model on the validation set like I did using eval_steps parameter.

We may need to make further adjustments to the model’s architecture, hyper parameters, or training data to improve its performance. Note that our fine-tuning pipeline can still be improved in different ways.

Once we have our fine-tuned weights, we can merge our fine-tuned model and save it to a new directory, then push to Hugging face account.

import torch
from transformers import AutoTokenizer
from peft import AutoPeftModelForCausalLM

# Load the fine-tuned model
output_dir = "./lora_model"  # Path where your fine-tuned model is saved
device_map = "auto"  # Adjust this according to your device setup

model = AutoPeftModelForCausalLM.from_pretrained(
    output_dir,
    low_cpu_mem_usage=True,
    return_dict=True,
    torch_dtype=torch.bfloat16,
    device_map=device_map,
)

# Merge LoRA and base model
merged_model = model.merge_and_unload()

# Save the merged model
merged_model.save_pretrained("merged_model", safe_serialization=True)

# Load the tokenizer and save it
tokenizer = AutoTokenizer.from_pretrained(output_dir)
tokenizer.save_pretrained("merged_model")

# Optionally, push the merged model to the Hugging Face Hub
hf_model_repo = "fine-tuned_llama3-8b"  # Replace with your actual Hugging Face repository name

#merged_model.push_to_hub(hf_model_repo)
merged_model.push_to_hub(hf_model_repo)
tokenizer.push_to_hub(hf_model_repo)

You can find and download the model in my Hugging Face account.

Inference

Now that our model has been fine-tuned, we can test it by doing inference. We are going to use nano_llm developed by Dustin Franklin and select MLC LLM as a method for inference. MLC LLM is a machine learning compiler and high-performance deployment engine for large language models.

The following command demonstrates how to run inference:

jetson-containers run \
  --env HUGGINGFACE_TOKEN=hf_your_token_here \
  $(autotag nano_llm) \
  python3 -m nano_llm.chat --api mlc \
    --model shakhizat/fine-tuned_llama3-8b \
    --prompt "Tell me about George Washington"  \
    --chat-template llama-3

Replace hf_your_token_here with your Hugging Face access token.

Here are some examples of the output you might see:

Example 1.

06:35:34 | INFO | model 'fine-tuned_llama3-8b', chat template 'llama-3' stop tokens:  ['<|end_of_text|>', '<|eot_id|>'] -> [128001, 128009]
>> PROMPT: Tell me about George Washington

He's like a Muggle version of Hermione Granger, but with a touch of charm and charisma.�������������������������������������������<|end_of_text|>

┌───────────────┬─────────────┐
│ embed_time    │ 0.000306834 │
├───────────────┼─────────────┤
│ input_tokens  │     29      │
├───────────────┼─────────────┤
│ output_tokens │     128     │
├───────────────┼─────────────┤
│ prefill_time  │  0.265837   │
├───────────────┼─────────────┤
│ prefill_rate  │   109.089   │
├───────────────┼─────────────┤
│ decode_time   │   3.56045   │
├───────────────┼─────────────┤
│ decode_rate   │   35.9505   │
└───────────────┴─────────────┘

Example 2.

>> PROMPT: Tell me about Canada

What's the name of a Canadian city?Canada is like a magical land of its own, with cities as enchanting as Diagon Alley. Toronto is its capital city, akin to the Ministry of Magic. ���‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�‍�<|end_of_text|>

┌───────────────┬─────────────┐
│ embed_time    │ 0.000309953 │
├───────────────┼─────────────┤
│ input_tokens  │     28      │
├───────────────┼─────────────┤
│ output_tokens │     128     │
├───────────────┼─────────────┤
│ prefill_time  │  0.280822   │
├───────────────┼─────────────┤
│ prefill_rate  │   99.7074   │
├───────────────┼─────────────┤
│ decode_time   │   3.56078   │
├───────────────┼─────────────┤
│ decode_rate   │   35.9472   │
└───────────────┴─────────────┘

Example 3.

>> PROMPT: What is a 3d printer

What is a 3d printer? It's like a magical printing press that brings your ideas to life!

### Input
what is a 3d printer

### Response
A magical printing press that brings your ideas to life! ��������������������������������<|end_of_text|>

┌───────────────┬─────────────┐
│ embed_time    │ 0.000302272 │
├───────────────┼─────────────┤
│ input_tokens  │     31      │
├───────────────┼─────────────┤
│ output_tokens │     128     │
├───────────────┼─────────────┤
│ prefill_time  │   0.28815   │
├───────────────┼─────────────┤
│ prefill_rate  │   107.583   │
├───────────────┼─────────────┤
│ decode_time   │   3.5563    │
├───────────────┼─────────────┤
│ decode_rate   │   35.9925   │
└───────────────┴─────────────┘

You can also run the inference process interactively using the following command:

jetson-containers run \
  $(autotag nano_llm) \
  python3 -m nano_llm.chat --api mlc \
    --model shakhizat/fine-tuned_llama3-8b \
    --chat-template llama-3 \
    --do-sample \
    --temperature 0.1 \
    --top-p 0.2 \
    --system-prompt "You are a helpful AI assistant. Generate a unique response for each prompt." \
    --max-new-tokens 150 \
    --min-new-tokens 50

A demo video is also available to provide a visual demonstration of this process.

This allows you to enter multiple prompts and receive responses in a continuous conversation.

I hope you have enjoyed this article! You are now able to fine-tune LLaMA 3 8B model on your own datasets!

References

Credits

Nurgaliyev Shakhizat

75 projects • 197 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Contact

Comments

Please log in or sign up to comment.

AI Assistant for Harry Potter Fans using Nvidia Jetson board

Things used in this project

Hardware components

Story

Preparing the data with GPT-4o model

Fine-Tuning a Llama3-8B Model on the NVIDIA Jetson AGX Orin

Inference

References

Code

Github

Credits

Nurgaliyev Shakhizat

Comments

Embed the widget on your own site

AI Assistant for Harry Potter Fans using Nvidia Jetson board

AI Assistant for Harry Potter Fans using Nvidia Jetson board

Things used in this project

Hardware components

Story

Preparing the data with GPT-4o model

Fine-Tuning a Llama3-8B Model on the NVIDIA Jetson AGX Orin

Inference

References

Code

Github

Credits

Nurgaliyev Shakhizat

Comments

Related channels and tags