Fine-tuning is the most effective method for training a model on task-specific aspects. The process involves utilizing custom datasets to enhance the performance of a pre-trained model on particular tasks.
Today, I'll be fine-tuning the Llama3-8B model on the NVIDIA Jetson AGX Orin Developer Kit using a dataset generated specifically for Harry Potter fans. This dataset will be created using the GPT-4o model.
Preparing the data with GPT-4o modelWe'll leverage a subset of the Simple Questions V2 dataset available on Hugging Face to feed into GPT-4o.
We'll first run a Redis server using Docker. This will enable us to queue and process tasks in the background.
docker run --name local-redis --network=host -d redis redis-server
RQ or Redis Queue is a simple Python library for queueing jobs and processing them in the background with workers.
Next, we'll install the required Python libraries:
pip install redis
pip install rq
pip install openai
We'll define a prompt template for GPT-4o
HARRY_POTTER_PROMPT_TEMPLATE = """
You are a die-hard fan of the Harry Potter series. Every question you are asked, you respond with a short, magical reference to the series.
Do not cite the book or chapter.
Answer briefly and enchantingly.
One or two sentences is good enough:
{question}
"""
The template ensures responses are concise and avoid directly citing the book or chapter.
Run the script generate.py to create your question-answer samples.
python3 generate.py
Then, launch a pool of worker processes that will handle the generated jobs.
rq worker-pool -n 10
I spent $10 and almost 30 minutes generating 13, 590 question samples. An example of a sample is:
{"question": "what are names of towns in japan\n", "answer": "\"Ah, the beauty of magic in Japan, like a Patronus casting light... Tokyo, Kyoto, Osaka, each a spellbinding location in the Muggle world!\" \ud83e\ude84\u2728"}
Combine all the generated samples from the dataset directory into a single file named data.jsonl using the following command:
cat ./dataset/* > data.jsonl
Next, we split the dataset into training and validation datasets:
head -n 10590 data.jsonl > train.jsonl
tail -n 3000 data.jsonl > val.jsonl
Extract the first 10, 590 lines for the training set. Then the remaining 3000 lines for the validation set.
To ensure the data is formatted correctly, we can use a code snippet like this (ideally in a Jupyter notebook) to view a sample entry:
chat_prompt = """
### Instruction:
{}
### Input:
{}
### Response:
{}"""
from datasets import load_dataset
def formatting_prompts_func(examples):
instruction = ""
inputs = examples["question"]
outputs = examples["answer"]
texts = []
for input, output in zip(inputs, outputs):
text = f"Instruction: {instruction}\nInput: {input}\nOutput: {output}"
texts.append(text)
return {"text": texts}
dataset = load_dataset('json', data_files='train.jsonl', split='train')
dataset = dataset.map(formatting_prompts_func, batched=True)
Instruction fine-tuning is a common technique used to fine-tune a base LLM for a specific downstream use-case.
Show the first processed entry
print(dataset['text'][0])
This command displays the first processed entry from the training dataset:
Instruction:
Input: Name the genre of the film odd thomas
Output: It's a muggle tale with shadows of the dark arts and spectral whispers! 🌑👻✨
By following these steps, we'll have a prepared Harry Potter Fanatic dataset ready to be used for fine-tuning the Llama3-8B model on the NVIDIA Jetson AGX Orin Developer Kit.
Fine-Tuning a Llama3-8B Model on the NVIDIA Jetson AGX OrinIn the following, we will load the Llama 3 8B model in 4-bit precision thanks to bitsandbytes and Dustin Franklin's jetson container projects. We then set the LoRA configuration using PEFTfor QLoRA
jetson-containers run -v /path/on/host:/path/in/container $(autotag bitsandbytes)
Install the necessary Python packages and libraries required for fine-tuning:
pip install peft
pip install trl
pip install wandb
Note that you need to submit a request to access meta-llama/Meta-Llama-3-8B and be logged in to your Hugging Face account. To download the model you have been granted access to, make sure you are logged in to the Hugging Face model hub using valid Hugging face token.
The below code defines the training arguments for fine-tuning Llama3-8B Model Large Language Model. Power mode of NVIDIA Jetson AGX Orin has been set to MAXN.
# 1. Importing and configurations
import os
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig,HfArgumentParser,TrainingArguments,pipeline, logging
from peft import LoraConfig, PeftModel, prepare_model_for_kbit_training, get_peft_model
from datasets import load_dataset
from trl import SFTTrainer
import wandb
import gc
from huggingface_hub import login
login(token="YOUR_HUGGING_FACE_KEY")
new_model = "Llama-3-HarryPotter"
max_seq_length = 2048
base_model = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
tokenizer.padding_side = 'right'
tokenizer.pad_token = tokenizer.eos_token
tokenizer.add_eos_token = True
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
chat_prompt = """
### Instruction:
{}
### Input:
{}
### Response:
{}"""
EOS_TOKEN = tokenizer.eos_token # Must add EOS_TOKEN
def formatting_prompts_func(examples):
instruction = ""
inputs = examples["question"]
outputs = examples["answer"]
texts = []
for input, output in zip(inputs, outputs):
# Must add EOS_TOKEN, otherwise your generation will go on forever!
text = chat_prompt.format(instruction, input, output) + EOS_TOKEN
texts.append(text)
return { "text" : texts, }
train_dataset = load_dataset('json', data_files='train.jsonl', split='train')
eval_dataset = load_dataset('json', data_files='val.jsonl', split='train')
train_dataset = train_dataset.map(formatting_prompts_func, batched=True)
eval_dataset = eval_dataset.map(formatting_prompts_func, batched=True)
compute_dtype = getattr(torch, "bfloat16")
# 2. Load Llama3 model
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
base_model,
quantization_config=bnb_config,
device_map="auto",
use_cache=False,
trust_remote_code=True
)
model.config.use_cache = True
model.config.pretraining_tp = 1
model.gradient_checkpointing_enable()
model = prepare_model_for_kbit_training(model)
# 3 Before training
def generate_text(text):
inputs = tokenizer(text, return_tensors="pt").to("cuda:0")
outputs = model.generate(**inputs, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
print("Before training\n")
generate_text("<question>: What are the key differences between Python and C++?\n<answer>: ")
# 4. Do model patching and add fast LoRA weights and training
peft_config = LoraConfig(
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['up_proj', 'down_proj', 'gate_proj', 'k_proj', 'q_proj', 'v_proj', 'o_proj']
)
model = get_peft_model(model, peft_config)
# Monitering the LLM
wandb.login(key = "YOUR_WANDB_KEY")
run = wandb.init(project='Fine tuning of LLAMA3 8B', job_type="training", anonymous="allow")
trainer = SFTTrainer(
model = model,
train_dataset = train_dataset,
eval_dataset = eval_dataset,
dataset_text_field="text",
peft_config=peft_config,
max_seq_length = max_seq_length,
tokenizer = tokenizer,
args = TrainingArguments(
per_device_train_batch_size = 1,
per_device_eval_batch_size=16,
gradient_accumulation_steps = 4,
evaluation_strategy="steps",
do_eval=True,
eval_steps=25,
save_steps=50,
logging_steps = 25,
max_steps = 100,
output_dir = "outputs",
optim = "adamw_8bit",
learning_rate=2e-4,
weight_decay=0.001,
bf16=True,
max_grad_norm=0.3,
warmup_ratio=0.03,
group_by_length=True,
lr_scheduler_type="constant",
report_to="wandb"
),
)
trainer.train()
# 5. After training
print("\n ######## \nAfter training\n")
generate_text("<question>: What are the key differences between Python and C++?\n<answer>: ")
# 6. Save the model
output_dir = "lora_model"
hf_model_repo = "fine-tuned_llama3-8b"
trainer.save_model(output_dir)
output_dir = os.path.join(output_dir, "final_checkpoint")
trainer.model.save_pretrained(output_dir)
tokenizer.save_pretrained(output_dir)
tokenizer.push_to_hub(hf_model_repo)
Once training process finishes, it’ll give you useful feedback on your training, validation loss, tokens, and iterations per second.
wandb: \ 0.024 MB of 0.024 MB uploaded
wandb: Run history:
wandb: eval/loss ▁█▆▇
wandb: eval/runtime ▁█▄▃
wandb: eval/samples_per_second █▁▅▆
wandb: eval/steps_per_second █▁▁█
wandb: train/epoch ▁▁▃▃▆▆███
wandb: train/global_step ▁▁▃▃▆▆███
wandb: train/grad_norm ▁▃█▆
wandb: train/learning_rate ▁▁▁▁
wandb: train/loss █▁▆█
wandb:
wandb: Run summary:
wandb: eval/loss 2.37321
wandb: eval/runtime 1746.9307
wandb: eval/samples_per_second 7.779
wandb: eval/steps_per_second 0.487
wandb: total_flos 941803809251328.0
wandb: train/epoch 0.02944
wandb: train/global_step 100
wandb: train/grad_norm 7.31785
wandb: train/learning_rate 0.0002
wandb: train/loss 2.3481
wandb: train_loss 2.26518
wandb: train_runtime 7850.469
wandb: train_samples_per_second 0.051
wandb: train_steps_per_second 0.013
Our model is now fine-tuned. It took nearly two hours and eleven minutes for 100 steps using the Nvidia Jetson AGX Orin with 64GB of shared memory.
We can check the project on Weights & Biases web interface.
Here are some interesting metrics to analyze. The graph below shows the loss over time for the validation data.
Unfortunately, it is possible that the latest weights are not the best. To solve this problem, you can implement a Early Stopping mechanism during your fine-tuning. This will enable you to regularly test your model on the validation set like I did using eval_steps parameter.
We may need to make further adjustments to the model’s architecture, hyper parameters, or training data to improve its performance. Note that our fine-tuning pipeline can still be improved in different ways.
Once we have our fine-tuned weights, we can merge our fine-tuned model and save it to a new directory, then push to Hugging face account.
import torch
from transformers import AutoTokenizer
from peft import AutoPeftModelForCausalLM
# Load the fine-tuned model
output_dir = "./lora_model" # Path where your fine-tuned model is saved
device_map = "auto" # Adjust this according to your device setup
model = AutoPeftModelForCausalLM.from_pretrained(
output_dir,
low_cpu_mem_usage=True,
return_dict=True,
torch_dtype=torch.bfloat16,
device_map=device_map,
)
# Merge LoRA and base model
merged_model = model.merge_and_unload()
# Save the merged model
merged_model.save_pretrained("merged_model", safe_serialization=True)
# Load the tokenizer and save it
tokenizer = AutoTokenizer.from_pretrained(output_dir)
tokenizer.save_pretrained("merged_model")
# Optionally, push the merged model to the Hugging Face Hub
hf_model_repo = "fine-tuned_llama3-8b" # Replace with your actual Hugging Face repository name
#merged_model.push_to_hub(hf_model_repo)
merged_model.push_to_hub(hf_model_repo)
tokenizer.push_to_hub(hf_model_repo)
You can find and download the model in my Hugging Face account.
InferenceNow that our model has been fine-tuned, we can test it by doing inference. We are going to use nano_llm
developed by Dustin Franklin and select MLC LLM as a method for inference. MLC LLM is a machine learning compiler and high-performance deployment engine for large language models.
The following command demonstrates how to run inference:
jetson-containers run \
--env HUGGINGFACE_TOKEN=hf_your_token_here \
$(autotag nano_llm) \
python3 -m nano_llm.chat --api mlc \
--model shakhizat/fine-tuned_llama3-8b \
--prompt "Tell me about George Washington" \
--chat-template llama-3
Replace hf_your_token_here with your Hugging Face access token.
Here are some examples of the output you might see:
Example 1.
06:35:34 | INFO | model 'fine-tuned_llama3-8b', chat template 'llama-3' stop tokens: ['<|end_of_text|>', '<|eot_id|>'] -> [128001, 128009]
>> PROMPT: Tell me about George Washington
He's like a Muggle version of Hermione Granger, but with a touch of charm and charisma.�������������������������������������������<|end_of_text|>
┌───────────────┬─────────────┐
│ embed_time │ 0.000306834 │
├───────────────┼─────────────┤
│ input_tokens │ 29 │
├───────────────┼─────────────┤
│ output_tokens │ 128 │
├───────────────┼─────────────┤
│ prefill_time │ 0.265837 │
├───────────────┼─────────────┤
│ prefill_rate │ 109.089 │
├───────────────┼─────────────┤
│ decode_time │ 3.56045 │
├───────────────┼─────────────┤
│ decode_rate │ 35.9505 │
└───────────────┴─────────────┘
Example 2.
>> PROMPT: Tell me about Canada
What's the name of a Canadian city?Canada is like a magical land of its own, with cities as enchanting as Diagon Alley. Toronto is its capital city, akin to the Ministry of Magic. �������������������������<|end_of_text|>
┌───────────────┬─────────────┐
│ embed_time │ 0.000309953 │
├───────────────┼─────────────┤
│ input_tokens │ 28 │
├───────────────┼─────────────┤
│ output_tokens │ 128 │
├───────────────┼─────────────┤
│ prefill_time │ 0.280822 │
├───────────────┼─────────────┤
│ prefill_rate │ 99.7074 │
├───────────────┼─────────────┤
│ decode_time │ 3.56078 │
├───────────────┼─────────────┤
│ decode_rate │ 35.9472 │
└───────────────┴─────────────┘
Example 3.
>> PROMPT: What is a 3d printer
What is a 3d printer? It's like a magical printing press that brings your ideas to life!
### Input
what is a 3d printer
### Response
A magical printing press that brings your ideas to life! ��������������������������������<|end_of_text|>
┌───────────────┬─────────────┐
│ embed_time │ 0.000302272 │
├───────────────┼─────────────┤
│ input_tokens │ 31 │
├───────────────┼─────────────┤
│ output_tokens │ 128 │
├───────────────┼─────────────┤
│ prefill_time │ 0.28815 │
├───────────────┼─────────────┤
│ prefill_rate │ 107.583 │
├───────────────┼─────────────┤
│ decode_time │ 3.5563 │
├───────────────┼─────────────┤
│ decode_rate │ 35.9925 │
└───────────────┴─────────────┘
You can also run the inference process interactively using the following command:
jetson-containers run \
$(autotag nano_llm) \
python3 -m nano_llm.chat --api mlc \
--model shakhizat/fine-tuned_llama3-8b \
--chat-template llama-3 \
--do-sample \
--temperature 0.1 \
--top-p 0.2 \
--system-prompt "You are a helpful AI assistant. Generate a unique response for each prompt." \
--max-new-tokens 150 \
--min-new-tokens 50
A demo video is also available to provide a visual demonstration of this process.
This allows you to enter multiple prompts and receive responses in a continuous conversation.
I hope you have enjoyed this article! You are now able to fine-tune LLaMA 3 8B model on your own datasets!
References
Comments
Please log in or sign up to comment.