1. Introduction
2. Project Basic Objective
3. Fine-tuning llamacode-34b using AMD 4*MI210 accelerator
4. Model quantization using GPTQ
5. Inference using vLLM on multi AMD MI210 accelerator
6. Model Evaluation for ChatBI
7. Inference Llama-3.1 70B using vLLM on AMD MI300X
8. Inference model using llama.cpp and embedding on CPU (AMD EPYC 9654
9. Deploy Model to Production
10. Overall System Architecture
Conclusion

Published July 24, 2024 © Apache-2.0

Generative BI (ChatBI) with AMD MI210/MI300X and EPYC

How to train a high-accuracy private large language model (LLM) to enhance efficiency in data analysis scenarios.

AdvancedFull instructions provided934

Generative AI - MI210: 2nd Place

Pervasive AI Developer Contest

Generative BI (ChatBI) with AMD MI210/MI300X and EPYC

Things used in this project

Hardware components

AMD Instinct™ MI210 Accelerators

AMD Instinct MI210 accelerators power enterprise, research, and academic HPC and AI workloads for single-server solutions and more.

AMD Instinct™ MI300X Accelerators

AMD Instinct™ MI300X accelerators are designed to deliver leadership performance for Generative AI workloads and HPC applications.

AMD EPYC 9654 96-Core Processor

AMD EPYC™ Processors power the highest-performing x86 servers for the modern data center

Software apps and online services

vLLM

vLLM is a fast and easy-to-use library for LLM inference and serving.

llama.cpp

Inference of Meta's LLaMA model (and others) in pure C/C++

DB-GPT

DB-GPT-Hub is an experimental project that leverages Large Language Models (LLMs) to achieve Text-to-SQL parsing.

Meta-Llama-3.1-70B-Instruct

Meta developed and released the Meta Llama 3 family of large language models (LLMs), a collection of pretrained and instruction tuned generative text models in 8 and 70B sizes.

CodeLlama-34b-Instruct-hf

Code Llama is a collection of pretrained and fine-tuned generative text models ranging in scale from 7 billion to 34 billion parameters.

Qwen2-57B-A14B-Instruct

Qwen2 is the new series of Qwen large language models. For Qwen2, including a Mixture-of-Experts model (57B-A14B).

Story

1. Introduction

Using large models to enhance capabilities in data analysis scenarios has become a popular application in the context of LLM, with many giants focusing on this field. At this year, Snowflake Cloud summit and Databricks Data+AI summit talk a lot of AI, and both of them launch AI tools. They want to improve the analytical capabilities and data utilization efficiency in BI scenarios by using large models.

According to Alibaba's research paper 'Is GPT-4 a Good Data Analyst', large models can significantly enhance productivity in data analysis scenarios. In completing the same simple data analysis tasks, GPT4 consumes less than one-tenth of the time required by human.

Time consumes between GPT4 and Human finish the same data analysis job

AI lead company, such as Snowflake and Databricks investing significant effort to enhance the capabilities of large language models in data analysis. Snowflake's Cortex AI pursues ultimate ease of use, focusing on two scenarios: Cortex Analyst (ChatBI) and Cortex Search (knowledge base search and Q&A). Meanwhile, Databricks' Mosica AI offers a complete Agent solution with the RAG + tool calling approach.

For some companies with special requirements for data security and confidentiality, they may encounter issues when using LLM SaaS API call . This is because in the data analysis BI scenario, they will face the company's most core and sensitive data assets. Additionally, there is another issue: even the best GPT4 model may encounter problems when dealing with data analysis scenarios, such as the production of SQL through natural language (Text-to-SQL), which often has many errors and incorrect syntax. This is due to the model being trained on various SQL dialects, making it difficult for the model to accurately output the SQL syntax you want.

Here are some potential methods for improvement:

Enrichment and diversification of the dataset: Ensure that the training dataset includes special of SQL dialects and query patterns to enhance the model's generalization capabilities.

Error feedback mechanism: Introduce an error feedback mechanism during the model training process, allowing the model to learn from its mistakes and make self-corrections.

Multi-task learning: Train the model using a multi-task learning framework to perform syntax analysis, semantic understanding, and SQL generation simultaneously, improving the model's understanding of SQL language structure.

Ensemble learning: Use ensemble learning methods to combine the predictions of multiple models to reduce the biases and errors of individual models.

Domain adaptation: Fine-tune the model for specific domains or types of databases to meet the needs of specific scenarios.

Interactive learning: Improve the model through interaction with users, allowing them to correct errors generated by the model and using these corrections for further training.

Reinforcement learning: Use reinforcement learning techniques to adjust the model's strategy when generating SQL statements based on the execution results.

Integration of rules: Integrate some hard-coded rules into the model to ensure that the generated SQL statements comply with basic syntax and logic

Continuous learning: The model needs to continuously learn from the latest data and query patterns to adapt to the changing data environment.

By employing these methods, the performance and accuracy of the model in the task of generating SQL from natural language can be gradually improved. However, this requires extensive research and experimentation to determine the most effective methods and may require ongoing maintenance and updates to adapt to new challenges.

2. Project Basic Objective

Training Private Model Maximize SQL Generation Accuracy: Spark SQL（https://spark.apache.org/sql/）, as a general-purpose data analysis SQL language, is widely used in data analysis scenarios. This project provides a detailed introduction to the complete process of model training, evaluation, and deployment on AMD's platform, starting from preparing the dataset; as well as some basic steps to significantly improve the model through fine-tuning. In our real-world scenarios, the model accuracy has been improved by more than 100% by carefully preparing the training SQL dataset, which can greatly enhance work efficiency in production.

Intelligent Agents: In data analysis, there are many business knowledge issues that need to be addressed, and it is also necessary to provide data analysts with an efficient knowledge base to enhance their overall development capabilities. This project also made some attempts, deploying models on CPUs to serve the knowledge AI Agent.

Model Driven Data Analysis (ChatBI): In data analysis , there is also a category known as metrics query (Text-to-API). In our practical experience, the requirements for the model's capabilities are quite high; the larger the model, the better the performance tends to be. Therefore, this project also attempts to deploy a 70B model on the MI300X to accelerate the overall inference speed and performance.

3. Fine-tuning llamacode-34b using AMD 4*MI210 accelerator

This step is fine tune model, tech the model learn how to write Spark SQL .

Step1: Apply 4 MI210 accelerator from https://aac.amd.com/

AIG MI210 resource page snapshot

The main step is below:

1. Appplication：Select Pytorch_2_1_2_Rocm6_1

2. Resource: change Number of GPUs is 4

3. Waiting container lanuch finish, run rocm-smi command, will see 4 accelerator card.

GPU info

Step2: Setting up the base implementation environment

Reference document：

Fine-tuning and inference using a single accelerator

Fine-tuning and inference using multiple accelerators

Install the required dependencies,

NOTE: bitsandbytes use branch rocm6.2_internal_testing

source ~/.bashrc
conda activate py_3.10

# Install Dependencies
pip install --upgrade pip
pip install einops lion_pytorch accelerate
pip install git+https://github.com/ROCm/transformers.git


# Install BitsandBytes
git clone --recurse https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm6.2_internal_testing
make hip
python setup.py install

# To leverage the SFTTrainer in TRL for model fine-tuning.
pip install trl

# To leverage PEFT for efficiently adapting pre-trained language models .
pip install peft

Step3: Check that the required packages can be imported.

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments
)
from peft import LoraConfig
from trl import SFTTrainer

Step3: Download the base model

base_model_name = "codellama/CodeLlama-34b-Instruct-hf"

# Load base model to GPU memory.
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code = True, device_map="auto")

# Load tokenizer.
tokenizer = AutoTokenizer.from_pretrained(
base_model_name,
trust_remote_code = True)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

NOTE：Add the device_map parameter to base model configuration for using muti accelerator.

Step4: Prepare fine-tuning dataset

I have upload my fine-tune dataset to https://huggingface.co/datasets/xinchun/spider This is carefully prepared fine-tuning dataset. IN order to tech model learn Spark SQL syntax, this dataset is write sql using Spark SQL . The dataset example is below:

{
"instruction": "请根据提供的表结构，生成满足需求的SPARK SQL。表结构是：\n CREATE TABLE vpsll.visit_sources (\n  visit_id BIGINT COMMENT '访问ID',\n  user_id BIGINT COMMENT '用户ID',\n  visit_time TIMESTAMP COMMENT '访问时间',\n  visit_source STRING COMMENT '访问来源',\n  referrer_url STRING COMMENT '来源页面URL',\n  search_keywords ARRAY<STRING> COMMENT '搜索关键词',\n  landing_page STRING COMMENT '着陆页面'\n)\nCOMMENT '访问来源表'\nPARTITIONED BY (dt STRING COMMENT 'dt分区字段，数据格式是YYYYMMDD');",
"input": "我想找出使用特定关键词（例如'搜索词'）进行搜索的访问次数。",
"output": "SELECT COUNT(*) FROM vpsll.visit_sources WHERE array_contains(search_keywords, '搜索词') AND dt = '${dt}';"
}

This dataset contains 1600 item SQL and may private UDF functions. After fine-tunning, the model could be output Spark SQL. I also try more SQL items for improve model accuracy.

# Dataset for fine-tuning.
training_dataset_name = "xinchun/spider"
training_dataset = load_dataset(training_dataset_name,  data_files= {'data/train_sql.json', 'data/train_udf.json'}, split = "train")

# Check the data.
print(training_dataset)
Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 2454
})

Step5: Configure fine-tuning parameters

To set up SFTTrainer parameters, set per_device_train_batch_size =2 for reduce GPU memory usage avoid of out-of-memory ERROR .

# Training parameters for SFTTrainer.
training_arguments = TrainingArguments(
    output_dir = "./results",
         num_train_epochs = 1,
         per_device_train_batch_size = 2,
         gradient_accumulation_steps = 1,
         optim = "paged_adamw_32bit",
         save_steps = 50,
         logging_steps = 50,
         learning_rate = 4e-5,
         weight_decay = 0.001,
         fp16=False,
         bf16=False,
         max_grad_norm = 0.3,
         max_steps = -1,
         warmup_ratio = 0.03,
         group_by_length = True,
         lr_scheduler_type = "constant",
         report_to = "tensorboard"
)

# Configure LoRA using the following code snippet.
peft_config = LoraConfig(
        lora_alpha = 16,
        lora_dropout = 0.1,
        r = 64,
        bias = "none",
        task_type = "CAUSAL_LM"
)
# View the number of trainable parameters.
from peft import get_peft_model
peft_model = get_peft_model(base_model, peft_config)
peft_model.print_trainable_parameters()

Initialize SFTTrainer with a PEFT LoRA configuration and run the trainer.

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts


# Initialize an SFT trainer.
sft_trainer = SFTTrainer(
    model = base_model,
    train_dataset = training_dataset,
    peft_config = peft_config,
    formatting_func=formatting_prompts_func,
    tokenizer = tokenizer,
    args = training_arguments
)

refer also: https://huggingface.co/docs/trl/sft_trainer

Step6: Fine-tuning model

# Run the trainer.

sft_trainer.train()

Fine-tuning finished in 33 minutes in may case.

100%|█████████████████████████████████████| 1227/1227 [33:29<00:00,  1.64s/it]

TrainOutput(global_step=1227, training_loss=0.2438461522115569, metrics={'train_runtime': 2009.902, 'train_samples_per_second': 1.221, 'train_steps_per_second': 0.61, 'train_loss': 0.2438461522115569, 'epoch': 1.0})

The all 4 accelerator gpu card usage is 100%

Fine-tuning GPU usage

Step7：Saving fine-tuned models

adapter_name = "CodeLlama-34b-Instruct-spark-sql-adapter"
sft_trainer.model.save_pretrained(adapter_name)

The saved PEFT adapter look like this on my acc system:

ls -l CodeLlama-34b-Instruct-spark-sql-adapter

total 308456
-rw-rw-r-- 1 aac aac      5109 Jul 24 12:47 README.md
-rw-rw-r-- 1 aac aac       660 Jul 24 12:47 adapter_config.json
-rw-rw-r-- 1 aac aac 314598968 Jul 24 12:47 adapter_model.safetensors

Step8：merge base model and

from peft import PeftModel

# Adapt the base model with the adapter
new_model = PeftModel.from_pretrained(base_model_name, adapter_name)

# Merge adapter
model = new_model.merge_and_unload()

# Save the merged model into local
model.save_pretrained("CodeLlama-34b-Instruct-spark-sql")

4. Model quantization using GPTQ

In order to improve inference token speed and can fit in one GPU, I quantization model from fp16 to INT4 and INT8.

Reference document：Model quantization techniques

Step1: Installing AutoGPTQ

Install AutoGPTQ from source for the appropriate ROCm version 6.1

# Clone the source code.
git clone https://github.com/AutoGPTQ/AutoGPTQ.git
cd AutoGPTQ

# Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device.
PYTORCH_ROCM_ARCH=gfx942 ROCM_VERSION=6.1 pip install .

# install other module
pip install optimum

Step2: Using GPTQ with Hugging Face Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig

base_model_name = "CodeLlama-34b-Instruct-spark-sql"

tokenizer = AutoTokenizer.from_pretrained(base_model_name)
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)

quantized_model = AutoModelForCausalLM.from_pretrained(
                         base_model_name,
                         device_map="auto",
                         quantization_config=gptq_config)

quantized_model.save_pretrained("CodeLlama-34b-Instruct-gptq-INT4") 
tokenizer.save_pretrained("CodeLlama-34b-Instruct-gptq-INT4")

After few minitues, finish quantized model. The output look like this.

Quantized model timeline

I alose quantizing llama3-72B but failed: HIP out of memory.

NOTE: Using vLLM Inference model with GPTQ quantization may lead error. ValueError: gptq_marlin quantization is currently not supported in ROCm.

5. Inference using vLLM on multi AMD MI210 accelerator

vLLM is a high-performance, memory-efficient serving engine for large language models (LLMs). It leverages PagedAttention and continuous batching techniques to rapidly process LLM requests. PagedAttention optimizes memory utilization by partitioning the Key-Value (KV) cache into manageable blocks. The KV cache stores previously computed keys and values, enabling the model to focus on calculating attention solely for the current token. These blocks are subsequently managed through a lookup table, akin to memory page handling in operating systems.

In my case, I install vLLM build from source.

source ~/.bashrc
conda activate py_3.10

1.Install PyTorch with ROCm 6.0

https://pytorch.org/get-started/locally/

pip install --upgrade pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0

2. Install Triton flash attention for ROCm

pip install ninja==1.10.2.4
pip install triton

NOTE: acc already Installed, you can skip this step.

3. build vLLM from source

git clone https://github.com/vllm-project/vllm.git  
cd vllm 
git checkout v0.5.2 
pip install -U -r requirements-rocm.txt 
python setup.py develop

build finish will output this

Using /opt/conda/envs/py_3.10/lib/python3.10/site-packages 
Finished processing dependencies for vllm==0.5.2+rocm614

4. Inference model and serving

NOTE: vLLM only support gptq and squeezellm

see: https://github.com/vllm-project/vllm/blob/main/vllm/config.py

rocm_supported_quantization = ["gptq", "squeezellm"]

One GPU card could not inference 34b fp16 model, will lead HIP out of memory.

INFO 07-26 06:59:32 model_runner.py:266] Loading model weights took 62.8569 GB


torch.OutOfMemoryError: HIP out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 8.00 MiB is free. Of the allocated memory 63.67 GiB is allocated by PyTorch, and 16.10 MiB is reserved by PyTorch but unallocated.

5.Multi GPU inference

Step1: modify vLLM source code due to solve error

ImportError: cannot import name 'default_dump_dir' from 'triton.runtime.cache' (/opt/conda/envs/py_3.10/lib/python3.10/site-packages/triton/runtime/cache.py)

remove default_dump_dir and other error import, it’s work fine.

Step2：Using 2 Card to serving the model

# using 2 GPU Card
export HIP_VISIBLE_DEVICES=0,1

# mutil work method change to spawn
export VLLM_WORKER_MULTIPROC_METHOD=spawn

# serving mode
vllm serve codellama/CodeLlama-34b-Instruct-hf  --tensor_parallel_size=2

Step3: Request the model use OpenAI API style

curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
    "model": "codellama/CodeLlama-34b-Instruct-hf",
    "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Please use Spark SQL to query orders table lasted order items"}
]
}'

The result may look this:

Sure, here's an example of how you can use Spark SQL to query the last order item for each order in the orders table:
```
SELECT 
  o.order_id, 
  i.product_id, 
  i.quantity, 
  i.unit_price, 
  i.order_item_id
FROM orders o 
LEFT JOIN order_items i 
  ON o.order_id = i.order_id 
WHERE i.order_item_id = 
  (SELECT MAX(order_item_id) 
   FROM order_items 
   WHERE order_id = o.order_id)
```
This query will return the last order item for each order in the orders table. The `LEFT JOIN` is used to join the orders table with the order_items table, and the `WHERE` clause is used to filter the results to only include the last order item for each order. The `MAX` function is used to get the maximum order item ID for each order, and the `SELECT` clause is used to return the fields from the orders table and the order item table.

You can also use window functions to achieve this, for example:
```
SELECT 
  o.order_id, 
  i.product_id, 
  i.quantity, 
  i.unit_price, 
  i.order_item_id,
  ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY i.order_item_id DESC) AS row_num
FROM orders o 
LEFT JOIN order_items i 
  ON o.order_id = i.order_id 
WHERE row_num = 1
```
This will give you the same result as the previous query, but it uses a window function to get the last order item for each order. The `ROW_NUMBER` function is used to assign a row number to each order item, and the `PARTITION BY` clause is used to partition the order items by order ID. The `ORDER BY` clause is used to order the order items by their ID in descending order, so that the last order item for each order is assigned a row number of 1. The `WHERE` clause is then used to filter the results to only include the last order item for each order.

Please note that the above queries are just examples, you may need to adjust them to fit your specific use case and data schema.

Token speed about 18.6 token/s

INFO 07-26 07:47:07 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

6. Model Evaluation for ChatBI

DP-GPT (https://github.com/eosphoros-ai/DB-GPT) purpose is to build infrastructure in the field of large models, through the development of multiple technical capabilities such as multi-model management (SMMF), Text2SQL effect optimization, RAG framework and optimization, Multi-Agents framework collaboration, AWEL (agent workflow orchestration), etc. Which makes large model applications with data simpler and more convenient.

The project has many useful tools for developers build AI application. I use it's sub project DB-GPT-Hub which is a Fine-tuning Framework to run spider dataset for model evaluation. The project can easy to run ChatBI(Text-to-SQL) execution accuracy of Spider. Some of the model's execution accuracy list in Evaluation LLM For Text-to-SQL.

By incorporating more SQL datasets, single fine-tuning model round would consume many hours. After several rounds of Fine-tuning tests and verifications, the final model achieved a relatively good result. In terms of accuracy within the db-gpt project, it reached the optimal performance, surpassing the results of GPT4.

spider sql execution accuracy

In real-world scenarios, due to the complexity of the number of tables and schemas in the data warehouse, it is common to have tens of thousands of tables, and the accuracy often does not reach such high levels. In our practical experience, if the scale of the dataset is within the range of 100 tables, simple queries can achieve a relatively high accuracy rate (90%), but for complex queries, the accuracy rate can drop significantly (even less than 10%). There is a huge gap for improvement the model in this regard, and various engineering methods need to be combined to jointly improve the accuracy rate.

7. Inference Llama-3.1 70B using vLLM on AMD MI300X

You can apply for an AMD MI300X instance at runpod.io.

AMD MI300X @runpod.io

Step1: Install vLLM build from source

This step can refer to the instructions provided earlier.

Step2: Inference Model

vllm serve NousResearch/Meta-Llama-3.1-70B-Instruct --max-model-len  4096

NOTE: limit the max-model-len to 4k for avoid error

[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (130864). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.

Inference speed reference about 4.3 token/s. No optimizations have been made in time.

INFO 07-31 10:33:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.

Inference Llama-3.1 70B GPU usage

Step3：Metrics Query(ChatBI, continuous iteration)

I have many examples here to illustrate how to determine the various intentions in the Metrics Query scenario. The following is a specific instance, allowing the model to judge what the user expects to do. After determining the user's intention, the subsequent parameters for data retrieval and APIs are constructed.

prompt：

我要提取一段话中的指标，只返回JSON格式的结果。JSON的格式为：{"指标":["",""]}

常用的指标有以下这些：
销售数量（含退拒）、销售额（退拒后）、动销商品数、在售商品数、销售额(含拒退)、销售天数、货量售卖比、销售数量（含退拒）(优惠券分析)、销售数量（退拒后）、销售额（含拒退）(优惠券分析)

这里相应示例：
请给出前年店庆期间，体用部类，唯品快抢活动下，各个商品spu的uv,这个例子指标是“uv”,返回的json为 {指标":["uv"]}
我要获取最近30天，品牌sn是10027344的品牌新客客户数,这个例子指标是“客户数”,返回的json为 {"指标":["客户数"]}
我想获取昨天，生活超市品类下的各个品牌的销售额,这个例子的指标是“销售额”,返回的json为 {"指标":["销售额"]}
上个月，女装部类下，各个供应商、各个品牌的销售额，uv和转换率,这个例子指标是“销售额”、“uv”、“转换率”,返回的json为 {"指标":["销售额","uv","转换率"]}

常用的指标有：销售额

请提取下面这句话的指标，只返回JSON格式结果

input：

帮我查出去年6月和今年6月，斯凯奇品牌名称的女鞋销售额和对比

expect output：

{
"指标":["销售额（退拒后）"、"销售额(含拒退)"]
}

After obtaining the JSON structure returned by the model, it will be converted into a SQL Query, and the result data of the query will be returned. There are still many complex steps involved, and we are continuously iterating to improve the query accuracy.

We have also tested the correctness of different models in converting JSON and constructed many samples for correctness assessment. On the MI300X, we have also evaluated the performance of Qwen2-72B-Instruct.

We utilized an Embedding model ( 'bge-large-zh') and vector database Milvus2 for the storage and matching of various dimensions and metrics, as well as the storage of various analytical methods. The entire process is quite complex, and in this article, due to the focus on inference on the AMD MI300X, it is not described in detail.

8. Inference model using llama.cpp and embedding on CPU (AMD EPYC 9654)

In my complete scenario, I also used the CPU to perform some inference of the model. In the 13B model, the CPU is already fully capable of meeting the needs. In scenarios where there is not enough GPU, due to the AMD EPYC 9654 having 96 cores, along with its powerful performance and memory bandwidth, using the CPU simultaneously can serve as a very good alternative.

infer performace vs different model size @2023 Q4 testing

In my test case from last year, AMD EYPC 9654 using 128 threads can improve 70b model 10% performance, from 3.77 to 4.01 tokens per second

Due to its good performance, I use the EPYC 9654 for inference in many scenarios. In the scenario of this article, the EPYC 9654 was ultimately used to run the Qwen2-57B-A14B-Instruct model for answering questions in knowledge base Agent. Inference by most powerful CPU's performance is good enough for production use.

I'm running qwen2-57b-a14b-instruct-q8_0.gguf in 128 threads, test cases acquire 17.51 tokens/seconds and 57.48 ms per token

Performance test case

./llama-cli -m /data/models/qwen2-57b-a14b-instruct-q8_0.gguf \
-n 512 -co -i \
--in-prefix "<|im_start|>user\n" \
--in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
-ngl 28 -fa  --temp 0.8 -p "唯品会是一家什么样的公司？"

CPU 128 thread usage

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
73268 apps      20   0   58.3g  56.9g  56.8g R 12800   7.5  22:14.42 llama-cli

llama.cpp running token speed

The Agent uses BGE model for knowledge embedding and Milvus2 for vector storage and retrieval. Both work exceptionally well on the EPYC 9654, delivering very high performance.

9. Deploy Model to Production

We deployed the fine-tune model online, providing an web interface for users, which includes Text-to SQL and Intelligent Agents. The generated SQL statements are checked by a syntax tool to ensure they are correct Spark SQL. Below are some screenshots of the system.

Text-to-SQL Copilot and Intelligent Agent

Text-to-SQL(ChatBI) Input

Query the daily number of user visits, ensuring to exclude duplicates in the calculation.

SQL Output

SELECT dt, COUNT(DISTINCT user_id) AS user_visits FROM vipdws.dws_log_pageview_merchandise_ds GROUP BY dt

The UI interface has an "执行" button, which allows direct running of SQL and returns the correct results.

Intelligent Agent Input

怎么解析JSON数组

Agent Output

SELECT from_json(json_array, 'array<struct<name:string, age:int>>') AS parsed_array FROM json_data;

10. Overall System Architecture

This is my overall system architecture, which includes the underlying hardware GPUs and CPUs, acceleration libraries, AI frameworks, the model layer, and the application layer. Originally, the entire architecture was based on NVIDIA's ecosystem. In this project, it was completely migrated to AMD's ecosystem and achieved the expected results very well. Some work is still ongoing, such as the metrics query (Text2API) is based on a 70B model, and it will be gradually verified and migration later.

System Architecture

Conclusion

On the acc.amd.com, by using multiple MI210 GPUs, I have essentially completed all the work I need. Although some supported optimization techniques and methods are not as comprehensive as NVIDIA's solutions, I believe that with the passage of time, the gradual improvement of ROCm and the development of AMD AI ecosystem, AMD's products will also become more competitive in AI scenarios.

sft-model.py

import torch
from datasets import load_dataset
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    TrainingArguments
)
from peft import LoraConfig
from trl import SFTTrainer

base_model_name = "codellama/CodeLlama-34b-Instruct-hf"

# Load base model to GPU memory.
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code = True, device_map="auto")

# Load tokenizer.
tokenizer = AutoTokenizer.from_pretrained(
base_model_name,
trust_remote_code = True)

tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"

# Dataset for fine-tuning.
training_dataset_name = "xinchun/spider"
training_dataset = load_dataset(training_dataset_name,  data_files= {'data/train_sql.json', 'data/train_udf.json'}, split = "train")

# Check the data.
print(training_dataset)

# Training parameters for SFTTrainer.
training_arguments = TrainingArguments(
    output_dir = "./results",
         num_train_epochs = 1,
         per_device_train_batch_size = 2,
         gradient_accumulation_steps = 1,
         optim = "paged_adamw_32bit",
         save_steps = 50,
         logging_steps = 50,
         learning_rate = 4e-5,
         weight_decay = 0.001,
         fp16=False,
         bf16=False,
         max_grad_norm = 0.3,
         max_steps = -1,
         warmup_ratio = 0.03,
         group_by_length = True,
         lr_scheduler_type = "constant",
         report_to = "tensorboard"
)

# Configure LoRA using the following code snippet.
peft_config = LoraConfig(
        lora_alpha = 16,
        lora_dropout = 0.1,
        r = 64,
        bias = "none",
        task_type = "CAUSAL_LM"
)
# View the number of trainable parameters.
from peft import get_peft_model
peft_model = get_peft_model(base_model, peft_config)
peft_model.print_trainable_parameters()

def formatting_prompts_func(example):
    output_texts = []
    for i in range(len(example['instruction'])):
        text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
        output_texts.append(text)
    return output_texts

# Initialize an SFT trainer.
sft_trainer = SFTTrainer(
    model = base_model,
    train_dataset = training_dataset,
    peft_config = peft_config,
    formatting_func=formatting_prompts_func,
    tokenizer = tokenizer,
    args = training_arguments
)

# Run the trainer.
sft_trainer.train()

Credits

XIN CHUN

1 project • 3 followers

Contact

Comments

Please log in or sign up to comment.

Awards

Generative AI - MI210: 2nd Place

Pervasive AI Developer Contest

Embed the widget on your own site

Generative BI (ChatBI) with AMD MI210/MI300X and EPYC

Generative BI (ChatBI) with AMD MI210/MI300X and EPYC

Things used in this project

Hardware components

Software apps and online services

Story

1. Introduction

2. Project Basic Objective

3. Fine-tuning llamacode-34b using AMD 4*MI210 accelerator

4. Model quantization using GPTQ

5. Inference using vLLM on multi AMD MI210 accelerator

6. Model Evaluation for ChatBI

7. Inference Llama-3.1 70B using vLLM on AMD MI300X

8. Inference model using llama.cpp and embedding on CPU (AMD EPYC 9654)

9. Deploy Model to Production

10. Overall System Architecture

Conclusion

Code

sft-model.py

Credits

XIN CHUN

Comments

Awards

Related channels and tags