Using large models to enhance capabilities in data analysis scenarios has become a popular application in the context of LLM, with many giants focusing on this field. At this year, Snowflake Cloud summit and Databricks Data+AI summit talk a lot of AI, and both of them launch AI tools. They want to improve the analytical capabilities and data utilization efficiency in BI scenarios by using large models.
According to Alibaba's research paper 'Is GPT-4 a Good Data Analyst', large models can significantly enhance productivity in data analysis scenarios. In completing the same simple data analysis tasks, GPT4 consumes less than one-tenth of the time required by human.
AI lead company, such as Snowflake and Databricks investing significant effort to enhance the capabilities of large language models in data analysis. Snowflake's Cortex AI pursues ultimate ease of use, focusing on two scenarios: Cortex Analyst (ChatBI) and Cortex Search (knowledge base search and Q&A). Meanwhile, Databricks' Mosica AI offers a complete Agent solution with the RAG + tool calling approach.
For some companies with special requirements for data security and confidentiality, they may encounter issues when using LLM SaaS API call . This is because in the data analysis BI scenario, they will face the company's most core and sensitive data assets. Additionally, there is another issue: even the best GPT4 model may encounter problems when dealing with data analysis scenarios, such as the production of SQL through natural language (Text-to-SQL), which often has many errors and incorrect syntax. This is due to the model being trained on various SQL dialects, making it difficult for the model to accurately output the SQL syntax you want.
Here are some potential methods for improvement:
Enrichment and diversification of the dataset: Ensure that the training dataset includes special of SQL dialects and query patterns to enhance the model's generalization capabilities.
Error feedback mechanism: Introduce an error feedback mechanism during the model training process, allowing the model to learn from its mistakes and make self-corrections.
Multi-task learning: Train the model using a multi-task learning framework to perform syntax analysis, semantic understanding, and SQL generation simultaneously, improving the model's understanding of SQL language structure.
Ensemble learning: Use ensemble learning methods to combine the predictions of multiple models to reduce the biases and errors of individual models.
Domain adaptation: Fine-tune the model for specific domains or types of databases to meet the needs of specific scenarios.
Interactive learning: Improve the model through interaction with users, allowing them to correct errors generated by the model and using these corrections for further training.
Reinforcement learning: Use reinforcement learning techniques to adjust the model's strategy when generating SQL statements based on the execution results.
Integration of rules: Integrate some hard-coded rules into the model to ensure that the generated SQL statements comply with basic syntax and logic
Continuous learning: The model needs to continuously learn from the latest data and query patterns to adapt to the changing data environment.
By employing these methods, the performance and accuracy of the model in the task of generating SQL from natural language can be gradually improved. However, this requires extensive research and experimentation to determine the most effective methods and may require ongoing maintenance and updates to adapt to new challenges.
2. Project Basic ObjectiveTraining Private Model Maximize SQL Generation Accuracy: Spark SQL(https://spark.apache.org/sql/), as a general-purpose data analysis SQL language, is widely used in data analysis scenarios. This project provides a detailed introduction to the complete process of model training, evaluation, and deployment on AMD's platform, starting from preparing the dataset; as well as some basic steps to significantly improve the model through fine-tuning. In our real-world scenarios, the model accuracy has been improved by more than 100% by carefully preparing the training SQL dataset, which can greatly enhance work efficiency in production.
Intelligent Agents: In data analysis, there are many business knowledge issues that need to be addressed, and it is also necessary to provide data analysts with an efficient knowledge base to enhance their overall development capabilities. This project also made some attempts, deploying models on CPUs to serve the knowledge AI Agent.
Model Driven Data Analysis (ChatBI): In data analysis , there is also a category known as metrics query (Text-to-API). In our practical experience, the requirements for the model's capabilities are quite high; the larger the model, the better the performance tends to be. Therefore, this project also attempts to deploy a 70B model on the MI300X to accelerate the overall inference speed and performance.
3. Fine-tuning llamacode-34b using AMD 4*MI210 acceleratorThis step is fine tune model, tech the model learn how to write Spark SQL .
- Step1: Apply 4 MI210 accelerator from https://aac.amd.com/
The main step is below:
1. Appplication:Select Pytorch_2_1_2_Rocm6_1
2. Resource: change Number of GPUs is 4
3. Waiting container lanuch finish, run rocm-smi command, will see 4 accelerator card.
- Step2: Setting up the base implementation environment
Reference document:
Fine-tuning and inference using a single accelerator
Fine-tuning and inference using multiple accelerators
Install the required dependencies,
NOTE: bitsandbytes use branch rocm6.2_internal_testing
source ~/.bashrc
conda activate py_3.10
# Install Dependencies
pip install --upgrade pip
pip install einops lion_pytorch accelerate
pip install git+https://github.com/ROCm/transformers.git
# Install BitsandBytes
git clone --recurse https://github.com/ROCm/bitsandbytes
cd bitsandbytes
git checkout rocm6.2_internal_testing
make hip
python setup.py install
# To leverage the SFTTrainer in TRL for model fine-tuning.
pip install trl
# To leverage PEFT for efficiently adapting pre-trained language models .
pip install peft
- Step3: Check that the required packages can be imported.
import torch
from datasets import load_dataset
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
TrainingArguments
)
from peft import LoraConfig
from trl import SFTTrainer
- Step3: Download the base model
base_model_name = "codellama/CodeLlama-34b-Instruct-hf"
# Load base model to GPU memory.
base_model = AutoModelForCausalLM.from_pretrained(base_model_name, trust_remote_code = True, device_map="auto")
# Load tokenizer.
tokenizer = AutoTokenizer.from_pretrained(
base_model_name,
trust_remote_code = True)
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = "right"
NOTE:Add the device_map
parameter to base model configuration for using muti accelerator.
- Step4: Prepare fine-tuning dataset
I have upload my fine-tune dataset to https://huggingface.co/datasets/xinchun/spider This is carefully prepared fine-tuning dataset. IN order to tech model learn Spark SQL syntax, this dataset is write sql using Spark SQL . The dataset example is below:
{
"instruction": "请根据提供的表结构,生成满足需求的SPARK SQL。表结构是:\n CREATE TABLE vpsll.visit_sources (\n visit_id BIGINT COMMENT '访问ID',\n user_id BIGINT COMMENT '用户ID',\n visit_time TIMESTAMP COMMENT '访问时间',\n visit_source STRING COMMENT '访问来源',\n referrer_url STRING COMMENT '来源页面URL',\n search_keywords ARRAY<STRING> COMMENT '搜索关键词',\n landing_page STRING COMMENT '着陆页面'\n)\nCOMMENT '访问来源表'\nPARTITIONED BY (dt STRING COMMENT 'dt分区字段,数据格式是YYYYMMDD');",
"input": "我想找出使用特定关键词(例如'搜索词')进行搜索的访问次数。",
"output": "SELECT COUNT(*) FROM vpsll.visit_sources WHERE array_contains(search_keywords, '搜索词') AND dt = '${dt}';"
}
This dataset contains 1600 item SQL and may private UDF functions. After fine-tunning, the model could be output Spark SQL. I also try more SQL items for improve model accuracy.
# Dataset for fine-tuning.
training_dataset_name = "xinchun/spider"
training_dataset = load_dataset(training_dataset_name, data_files= {'data/train_sql.json', 'data/train_udf.json'}, split = "train")
# Check the data.
print(training_dataset)
Dataset({
features: ['instruction', 'input', 'output'],
num_rows: 2454
})
- Step5: Configure fine-tuning parameters
To set up SFTTrainer
parameters, set per_device_train_batch_size =2 for reduce GPU memory usage avoid of out-of-memory ERROR .
# Training parameters for SFTTrainer.
training_arguments = TrainingArguments(
output_dir = "./results",
num_train_epochs = 1,
per_device_train_batch_size = 2,
gradient_accumulation_steps = 1,
optim = "paged_adamw_32bit",
save_steps = 50,
logging_steps = 50,
learning_rate = 4e-5,
weight_decay = 0.001,
fp16=False,
bf16=False,
max_grad_norm = 0.3,
max_steps = -1,
warmup_ratio = 0.03,
group_by_length = True,
lr_scheduler_type = "constant",
report_to = "tensorboard"
)
# Configure LoRA using the following code snippet.
peft_config = LoraConfig(
lora_alpha = 16,
lora_dropout = 0.1,
r = 64,
bias = "none",
task_type = "CAUSAL_LM"
)
# View the number of trainable parameters.
from peft import get_peft_model
peft_model = get_peft_model(base_model, peft_config)
peft_model.print_trainable_parameters()
Initialize SFTTrainer
with a PEFT LoRA configuration and run the trainer.
def formatting_prompts_func(example):
output_texts = []
for i in range(len(example['instruction'])):
text = f"### Question: {example['instruction'][i]}\n ### Answer: {example['output'][i]}"
output_texts.append(text)
return output_texts
# Initialize an SFT trainer.
sft_trainer = SFTTrainer(
model = base_model,
train_dataset = training_dataset,
peft_config = peft_config,
formatting_func=formatting_prompts_func,
tokenizer = tokenizer,
args = training_arguments
)
refer also: https://huggingface.co/docs/trl/sft_trainer
- Step6: Fine-tuning model
# Run the trainer.
sft_trainer.train()
Fine-tuning finished in 33 minutes in may case.
100%|█████████████████████████████████████| 1227/1227 [33:29<00:00, 1.64s/it]
TrainOutput(global_step=1227, training_loss=0.2438461522115569, metrics={'train_runtime': 2009.902, 'train_samples_per_second': 1.221, 'train_steps_per_second': 0.61, 'train_loss': 0.2438461522115569, 'epoch': 1.0})
The all 4 accelerator gpu card usage is 100%
- Step7:Saving fine-tuned models
adapter_name = "CodeLlama-34b-Instruct-spark-sql-adapter"
sft_trainer.model.save_pretrained(adapter_name)
The saved PEFT adapter look like this on my acc system:
ls -l CodeLlama-34b-Instruct-spark-sql-adapter
total 308456
-rw-rw-r-- 1 aac aac 5109 Jul 24 12:47 README.md
-rw-rw-r-- 1 aac aac 660 Jul 24 12:47 adapter_config.json
-rw-rw-r-- 1 aac aac 314598968 Jul 24 12:47 adapter_model.safetensors
- Step8:merge base model and
from peft import PeftModel
# Adapt the base model with the adapter
new_model = PeftModel.from_pretrained(base_model_name, adapter_name)
# Merge adapter
model = new_model.merge_and_unload()
# Save the merged model into local
model.save_pretrained("CodeLlama-34b-Instruct-spark-sql")
4. Model quantization using GPTQIn order to improve inference token speed and can fit in one GPU, I quantization model from fp16 to INT4 and INT8.
Reference document:Model quantization techniques
Step1: Installing AutoGPTQ
Install AutoGPTQ from source for the appropriate ROCm version 6.1
# Clone the source code.
git clone https://github.com/AutoGPTQ/AutoGPTQ.git
cd AutoGPTQ
# Speed up the compilation by specifying PYTORCH_ROCM_ARCH to target device.
PYTORCH_ROCM_ARCH=gfx942 ROCM_VERSION=6.1 pip install .
# install other module
pip install optimum
Step2: Using GPTQ with Hugging Face Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer, GPTQConfig
base_model_name = "CodeLlama-34b-Instruct-spark-sql"
tokenizer = AutoTokenizer.from_pretrained(base_model_name)
gptq_config = GPTQConfig(bits=4, dataset="c4", tokenizer=tokenizer)
quantized_model = AutoModelForCausalLM.from_pretrained(
base_model_name,
device_map="auto",
quantization_config=gptq_config)
quantized_model.save_pretrained("CodeLlama-34b-Instruct-gptq-INT4")
tokenizer.save_pretrained("CodeLlama-34b-Instruct-gptq-INT4")
After few minitues, finish quantized model. The output look like this.
I alose quantizing llama3-72B but failed: HIP out of memory.
NOTE: Using vLLM Inference model with GPTQ quantization may lead error. ValueError: gptq_marlin quantization is currently not supported in ROCm.5. Inference using vLLM on multi AMD MI210 accelerator
vLLM is a high-performance, memory-efficient serving engine for large language models (LLMs). It leverages PagedAttention and continuous batching techniques to rapidly process LLM requests. PagedAttention optimizes memory utilization by partitioning the Key-Value (KV) cache into manageable blocks. The KV cache stores previously computed keys and values, enabling the model to focus on calculating attention solely for the current token. These blocks are subsequently managed through a lookup table, akin to memory page handling in operating systems.
In my case, I install vLLM build from source.
source ~/.bashrc
conda activate py_3.10
1.Install PyTorch with ROCm 6.0
https://pytorch.org/get-started/locally/
pip install --upgrade pip
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
2. Install Triton flash attention for ROCm
pip install ninja==1.10.2.4
pip install triton
NOTE: acc already Installed, you can skip this step.
3. build vLLM from source
git clone https://github.com/vllm-project/vllm.git
cd vllm
git checkout v0.5.2
pip install -U -r requirements-rocm.txt
python setup.py develop
build finish will output this
Using /opt/conda/envs/py_3.10/lib/python3.10/site-packages
Finished processing dependencies for vllm==0.5.2+rocm614
4. Inference model and serving
NOTE: vLLM only support gptq and squeezellm
see: https://github.com/vllm-project/vllm/blob/main/vllm/config.py
rocm_supported_quantization = ["gptq", "squeezellm"]
One GPU card could not inference 34b fp16 model, will lead HIP out of memory.
INFO 07-26 06:59:32 model_runner.py:266] Loading model weights took 62.8569 GB
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 256.00 MiB. GPU 0 has a total capacity of 63.98 GiB of which 8.00 MiB is free. Of the allocated memory 63.67 GiB is allocated by PyTorch, and 16.10 MiB is reserved by PyTorch but unallocated.
5.Multi GPU inference
Step1: modify vLLM source code due to solve error
ImportError: cannot import name 'default_dump_dir' from 'triton.runtime.cache' (/opt/conda/envs/py_3.10/lib/python3.10/site-packages/triton/runtime/cache.py)
remove default_dump_dir and other error import, it’s work fine.
Step2:Using 2 Card to serving the model
# using 2 GPU Card
export HIP_VISIBLE_DEVICES=0,1
# mutil work method change to spawn
export VLLM_WORKER_MULTIPROC_METHOD=spawn
# serving mode
vllm serve codellama/CodeLlama-34b-Instruct-hf --tensor_parallel_size=2
Step3: Request the model use OpenAI API style
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "codellama/CodeLlama-34b-Instruct-hf",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Please use Spark SQL to query orders table lasted order items"}
]
}'
The result may look this:
Sure, here's an example of how you can use Spark SQL to query the last order item for each order in the orders table:
```
SELECT
o.order_id,
i.product_id,
i.quantity,
i.unit_price,
i.order_item_id
FROM orders o
LEFT JOIN order_items i
ON o.order_id = i.order_id
WHERE i.order_item_id =
(SELECT MAX(order_item_id)
FROM order_items
WHERE order_id = o.order_id)
```
This query will return the last order item for each order in the orders table. The `LEFT JOIN` is used to join the orders table with the order_items table, and the `WHERE` clause is used to filter the results to only include the last order item for each order. The `MAX` function is used to get the maximum order item ID for each order, and the `SELECT` clause is used to return the fields from the orders table and the order item table.
You can also use window functions to achieve this, for example:
```
SELECT
o.order_id,
i.product_id,
i.quantity,
i.unit_price,
i.order_item_id,
ROW_NUMBER() OVER (PARTITION BY o.order_id ORDER BY i.order_item_id DESC) AS row_num
FROM orders o
LEFT JOIN order_items i
ON o.order_id = i.order_id
WHERE row_num = 1
```
This will give you the same result as the previous query, but it uses a window function to get the last order item for each order. The `ROW_NUMBER` function is used to assign a row number to each order item, and the `PARTITION BY` clause is used to partition the order items by order ID. The `ORDER BY` clause is used to order the order items by their ID in descending order, so that the last order item for each order is assigned a row number of 1. The `WHERE` clause is then used to filter the results to only include the last order item for each order.
Please note that the above queries are just examples, you may need to adjust them to fit your specific use case and data schema.
Token speed about 18.6 token/s
INFO 07-26 07:47:07 metrics.py:295] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 18.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
6. Model Evaluation for ChatBIDP-GPT (https://github.com/eosphoros-ai/DB-GPT) purpose is to build infrastructure in the field of large models, through the development of multiple technical capabilities such as multi-model management (SMMF), Text2SQL effect optimization, RAG framework and optimization, Multi-Agents framework collaboration, AWEL (agent workflow orchestration), etc. Which makes large model applications with data simpler and more convenient.
The project has many useful tools for developers build AI application. I use it's sub project DB-GPT-Hub which is a Fine-tuning Framework to run spider dataset for model evaluation. The project can easy to run ChatBI(Text-to-SQL) execution accuracy of Spider. Some of the model's execution accuracy list in Evaluation LLM For Text-to-SQL.
By incorporating more SQL datasets, single fine-tuning model round would consume many hours. After several rounds of Fine-tuning tests and verifications, the final model achieved a relatively good result. In terms of accuracy within the db-gpt project, it reached the optimal performance, surpassing the results of GPT4.
In real-world scenarios, due to the complexity of the number of tables and schemas in the data warehouse, it is common to have tens of thousands of tables, and the accuracy often does not reach such high levels. In our practical experience, if the scale of the dataset is within the range of 100 tables, simple queries can achieve a relatively high accuracy rate (90%), but for complex queries, the accuracy rate can drop significantly (even less than 10%). There is a huge gap for improvement the model in this regard, and various engineering methods need to be combined to jointly improve the accuracy rate.
7. Inference Llama-3.1 70B using vLLM on AMD MI300XYou can apply for an AMD MI300X instance at runpod.io.
Step1: Install vLLM build from source
This step can refer to the instructions provided earlier.
Step2: Inference Model
vllm serve NousResearch/Meta-Llama-3.1-70B-Instruct --max-model-len 4096
NOTE: limit the max-model-len to 4k for avoid error
[rank0]: ValueError: The model's max seq len (131072) is larger than the maximum number of tokens that can be stored in KV cache (130864). Try increasing `gpu_memory_utilization` or decreasing `max_model_len` when initializing the engine.
Inference speed reference about 4.3 token/s. No optimizations have been made in time.
INFO 07-31 10:33:27 metrics.py:406] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 4.3 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
Step3:Metrics Query(ChatBI, continuous iteration)
I have many examples here to illustrate how to determine the various intentions in the Metrics Query scenario. The following is a specific instance, allowing the model to judge what the user expects to do. After determining the user's intention, the subsequent parameters for data retrieval and APIs are constructed.
prompt:
我要提取一段话中的指标,只返回JSON格式的结果。JSON的格式为:{"指标":["",""]}
常用的指标有以下这些:
销售数量(含退拒)、销售额(退拒后)、动销商品数、在售商品数、销售额(含拒退)、销售天数、货量售卖比、销售数量(含退拒)(优惠券分析)、销售数量(退拒后)、销售额(含拒退)(优惠券分析)
这里相应示例:
请给出前年店庆期间,体用部类,唯品快抢活动下,各个商品spu的uv,这个例子指标是“uv”,返回的json为 {指标":["uv"]}
我要获取最近30天,品牌sn是10027344的品牌新客客户数,这个例子指标是“客户数”,返回的json为 {"指标":["客户数"]}
我想获取昨天,生活超市品类下的各个品牌的销售额,这个例子的指标是“销售额”,返回的json为 {"指标":["销售额"]}
上个月,女装部类下,各个供应商、各个品牌的销售额,uv和转换率,这个例子指标是“销售额”、“uv”、“转换率”,返回的json为 {"指标":["销售额","uv","转换率"]}
常用的指标有:销售额
请提取下面这句话的指标,只返回JSON格式结果
input:
帮我查出去年6月和今年6月,斯凯奇品牌名称的女鞋销售额和对比
expect output:
{
"指标":["销售额(退拒后)"、"销售额(含拒退)"]
}
After obtaining the JSON structure returned by the model, it will be converted into a SQL Query, and the result data of the query will be returned. There are still many complex steps involved, and we are continuously iterating to improve the query accuracy.
We have also tested the correctness of different models in converting JSON and constructed many samples for correctness assessment. On the MI300X, we have also evaluated the performance of Qwen2-72B-Instruct.
We utilized an Embedding model ( 'bge-large-zh') and vector database Milvus2 for the storage and matching of various dimensions and metrics, as well as the storage of various analytical methods. The entire process is quite complex, and in this article, due to the focus on inference on the AMD MI300X, it is not described in detail.
8. Inference model using llama.cpp and embedding on CPU (AMD EPYC 9654)In my complete scenario, I also used the CPU to perform some inference of the model. In the 13B model, the CPU is already fully capable of meeting the needs. In scenarios where there is not enough GPU, due to the AMD EPYC 9654 having 96 cores, along with its powerful performance and memory bandwidth, using the CPU simultaneously can serve as a very good alternative.
In my test case from last year, AMD EYPC 9654 using 128 threads can improve 70b model 10% performance, from 3.77 to 4.01 tokens per second
Due to its good performance, I use the EPYC 9654 for inference in many scenarios. In the scenario of this article, the EPYC 9654 was ultimately used to run the Qwen2-57B-A14B-Instruct model for answering questions in knowledge base Agent. Inference by most powerful CPU's performance is good enough for production use.
I'm running qwen2-57b-a14b-instruct-q8_0.gguf in 128 threads, test cases acquire 17.51 tokens/seconds and 57.48 ms per token
Performance test case
./llama-cli -m /data/models/qwen2-57b-a14b-instruct-q8_0.gguf \
-n 512 -co -i \
--in-prefix "<|im_start|>user\n" \
--in-suffix "<|im_end|>\n<|im_start|>assistant\n" \
-ngl 28 -fa --temp 0.8 -p "唯品会是一家什么样的公司?"
CPU 128 thread usage
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
73268 apps 20 0 58.3g 56.9g 56.8g R 12800 7.5 22:14.42 llama-cli
The Agent uses BGE model for knowledge embedding and Milvus2 for vector storage and retrieval. Both work exceptionally well on the EPYC 9654, delivering very high performance.
9. Deploy Model to ProductionWe deployed the fine-tune model online, providing an web interface for users, which includes Text-to SQL and Intelligent Agents. The generated SQL statements are checked by a syntax tool to ensure they are correct Spark SQL. Below are some screenshots of the system.
Text-to-SQL(ChatBI) Input
Query the daily number of user visits, ensuring to exclude duplicates in the calculation.
SQL Output
SELECT dt, COUNT(DISTINCT user_id) AS user_visits FROM vipdws.dws_log_pageview_merchandise_ds GROUP BY dt
The UI interface has an "执行" button, which allows direct running of SQL and returns the correct results.
Intelligent Agent Input
怎么解析JSON数组
Agent Output
SELECT from_json(json_array, 'array<struct<name:string, age:int>>') AS parsed_array FROM json_data;
10. Overall System ArchitectureThis is my overall system architecture, which includes the underlying hardware GPUs and CPUs, acceleration libraries, AI frameworks, the model layer, and the application layer. Originally, the entire architecture was based on NVIDIA's ecosystem. In this project, it was completely migrated to AMD's ecosystem and achieved the expected results very well. Some work is still ongoing, such as the metrics query (Text2API) is based on a 70B model, and it will be gradually verified and migration later.
On the acc.amd.com, by using multiple MI210 GPUs, I have essentially completed all the work I need. Although some supported optimization techniques and methods are not as comprehensive as NVIDIA's solutions, I believe that with the passage of time, the gradual improvement of ROCm and the development of AMD AI ecosystem, AMD's products will also become more competitive in AI scenarios.
Comments