Published March 23, 2024

Running LLM on AMD NPU Hardware

I will port my LLM-based Japanese-English machine translation model to AMD's new RyzenAI enabled PC (with NPU).

AdvancedWork in progress9,832

Things used in this project

Hardware components

Minisforum Venus UM790 Pro with AMD Ryzen™ 9

Story

introduction video

Important Note

RyzenAI Software 1.2 was announced on July 29, 2024. The following document provides details on how to set up RyzenAI Software 1.1.

RyzenAI Software 1.1 required manual steps to install, which is difficult for people who are not familiar with Windows. To solve this, I have written detailed instructions in this document to try to make it work for all users.

Ryzen AI Software 1.2 is now automatically installed, so the manual setup described in this document is no longer necessary. If you want to use the 1.2 driver, please refer to the official page. (I have added Appendix: Tips for running Ryzen AI Software 1.2 at the bottom of this page, so please refer to it if necessary.)

This page was created for 1.1 and verified with version 1.1. Unfortunately, we were not able to rewrite this page for 1.2 because 1.2 was announced just before the contest deadline(July 31, 2024).

Also amd/RyzenAI-SW 1.2 was announced on July 29, 2024. The following document use amd/RyzenAI-SW 1.1. Please note that the folder and script contents have been completely changed. This project uses the script and file from the 1.1 branch.

About this project

This project focuses on porting my LLM-based Japanese-English machine translation model to AMD's new PC without GPUs, ensuring flawless operation.

As a contestant in the AMD Pervasive AI Developer Contest PC AI, I proposed creating an LLM-based Japanese-English machine translation model that runs seamlessly on a PC without a GPU.

This approach allows for a translation model that is:

Data-secure
Fully offline
Easily customizable

This model will enable the development of various new communication tools that break down language barriers.

In addition to creating task-specific LLMs for translation, we ported general-purpose LLMs and verified their performance.

Project Approach

At the start of the project, I had already created two LLM-based Japanese-English machine translation models with Hugging Face Transformers.

1. Meta's LLaMA 2 7b based translation model

ALMA-7B-Ja-V2

2. Google's Gemma 2 9b based translation model

C3TR-Adapter

The project goal would be achieved if ALMA-7B-Ja-V2 could be ported, but I was also hoping to try to port the C3TR-Adapter.

However, LLaMA 3.1 was released while the project was in progress, so in the end I created and ported two models:

- LLaMA 2-based multilingual(English, Japanese, French, Chinese(Mandarin)) translation model(ALMA-Ja-V3-amd-npu)

- LLaMA 3.1-based multilingual(English, Japanese, French, Chinese(Mandarin)) translation model(llama-translate-amd-npu)

Hardware information

UM790 Pro
CPU AMD Ryzen 9 7940HS Processor
GPU AMD Radeon 780M
System Memory 16GB x 2
Storage 512GB

model type

Very small. Width 130mm, height 126mm. Body weight 666g (1240g including AC adapter and AC cord)

PC body

It complies with VESA specifications and can be installed behind the display.

VESA

all component

Frequently reported problems during setup (For Ryzen AI Software 1.1)

Enabling NPU in BIOS

1. First, check if your computer is a Ryzen AI engine enabled laptop.

2. The UM790 Pro has the NPU disabled in the BIOS by default. If left as is, the device cannot be found in the hardware device manager and the installation will fail.

3. To access the BIOS setting screen, hold down the DEL key when turning on the computer.

BIOS Start Screen

4. Press Setup.

BIOS Advanced Menu

5. Select "Advanced", then "CPU Configuration"

Bios IPU Menu

6. Change "IPU Control" drop down menu to Enable.

7. Select "Save & Exit" from the left menu.

How to download and install Visual Studio 2019

Visual Studio 2019 is a bit old and needs to be downloaded from Visual Studio Older Downloads.
When installing a new version, the version check of the setup script may fail

Problems with amd_install_kipudrv.bat

Some users report an error stating the driver is already installed after running this batch file.
It's better to overwrite the existing driver as it may be outdated.
Run CMD in administrator mode to execute this successfully.

Problems with xrt_coreutil.dll

Issues related to the xrt_coreutil.dll file can occur during installation.
The installation may fail because this DLL file is not found.
Some have resolved this by manually copying the DLL file to the appropriate directory and adding it to the PATH.

How to setup RyzenAI software 1.1

(1)Enable IPU

As mentioned above, enable the IPU in the BIOS menu. IPU is the old name for NPU.

(2)Follow the official installation manual

AMD official installation setup page(1.1)

After enable IPU, follow the official installation manual. However, there are some pitfalls, which are explained below.

(2-1)Download the NPU Driver and setup

You need AMD account, so make account and download ipu_stack_rel_silicon_prod_1.1.zip.

After downloading, unzip the zip file, launch CMD and run the batch file, but remember to run CMD as administrator. Because if you do not run CMD as an administrator, you will get a red error message saying access is denied and the process will fail.

The following screen is a little different because it is from the Japanese version of Windows 11. However, the steps are the same.

Type cmd in the bottom center of the screen input field
Right-click on the Command prompt part of the menu
Select Run as administrator

Then go to the folder where you unzipped ipu_stack_rel_silicon_prod_1.1.zip

cd ipu_stack_rel_silicon_prod_1.1

then

.\amd_install_kipudrv.bat

If you run .\amd_install_kipudrv.bat, you will be told that the NPU Driver is already installed, but the version that comes pre-installed driver is 2023/05/15, and the driver in the downloaded folder is 2024/02/13, so you must install it.

Then check device manager and versions.

launch device manager

System device -> AMD IPU Device

System Device

Check driver date is 2024/02/13

Check Date

(2-2)Install additional softwares.

(2-2-1) install Visual Studio 2019 Community Edition.

Yes, you need VC 2019 Community Edition(Free). So you need Microsoft account. Go to Visual Studio Older Downloads page and join Dev Essentials program and download 2019 version.

I chose the following:

"Python development"
"Desktop development with C++"

VC 2019 install

(2-2-1) install CMake

Go to Cmake download page and download

I used Windows x64 Installer(cmake-3.29.0-windows-x86_64.msi).

And choose "Add Cmake to the system PATH for the current user"

cmake install

(2-2-3)install Miniconda

Go to miniconda download page and download it.

I used Miniconda3-latest-Windows-x86_64.msi.

I left the settings at the installation default and did not add the PATH. However, there were some failures during the setup afterwards, so I ended up adding the PATH manually.

Miniconda Install

Now let's add PATH to tell CMD where Conda is located.

However, if you want to use a different python installation, it is better to skip this step.

System Environment Variables

input env and choose "Edit System Environment Variables".

Environment Variables

push "Environment Variables"

Select Path

Select "Path" and push "Edit".

push New

Push "New" and Enter the directory where you installed miniconda, and + \Script, + \lib\bin

Add 3 path

Above is my case. "dev1" is my account. so you need to replace it with your account.

C:\User\<your_account_name>\miniconda3\
C:\User\<your_account_name>\miniconda3\Scripts
C:\User\<your_account_name>\miniconda3\lib\bin

I did not install Python because miniconda includes python 3.12

(2-3)Install the Ryzen AI Software(For 1.1)

Download the ryzen-ai-sw-1.1.zip, Ryzen AI Software installation package(ryzen-ai-sw-1.1.zip) and extract it.

The Ryzen AI Software directory will be referenced many times later during inference, so make sure to store it in a fixed location such as under the C drive, not in a temporary location.

Start CMD as administrator as before.

cd \ryzen-ai-sw-1.1
.\install.bat

.....
Setting RYZEN_AI_INSTALLER env variable ...
Setting XLNX_VART_FIRMWARE env variable ...
Created conda env: ryzenai-1.1-20240603-234049

In my case, the name of the Conda virtual environment was "ryzenai-1.1-20240603-234049"

This is the name of your virtual environment so it will probably be different. Please make a note of it.

If you have forgotten it, you can check it with the "conda env list" command:

C:\Users\dev1>conda env list
# conda environments:
#
base                     C:\Users\dev1\miniconda3
ryzenai-1.1-20240603-234049     C:\Users\dev1\miniconda3\envs\ryzenai-1.1-20240603-234049
ryzenai-transformers     C:\Users\dev1\miniconda3\envs\ryzenai-transformers

Now, after installing Conda, initialize it:

conda init

Then quit cmd and launch it again.

Run the following command.

C:\Users\user\Downloads\ryzen-ai-sw-1.1\ryzen-ai-sw-1.1>conda activate ryzenai-1.1-20240603-234049

Remember, in my case it was ryzenai-1.1-20240603-234049. replace it with your virtual environment name.

Then Let's test it.

cd .\ryzen-ai-sw-1.1>quicktest
python quicktest.py

...
[Vitis AI EP] No. of Subgraphs :   CPU     1    IPU     1 Actually running on IPU     1
2 0 2 4 - 0 6 - 0 3   2 3 : 5 4 : 4 3 . 9 1 8 6 0 4 5   [ W : o n n x r u n t i m e : ,   s e s s i o n _ s t a t e . c c : 1 1 6 9   o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ]   S o m e   n o d e s   w e r e   n o t   a s s i g n e d   t o   t h e   p r e f e r r e d   e x e c u t i o n   p r o v i d e r s   w h i c h   m a y   o r   m a y   n o t   h a v e   a n   n e g a t i v e   i m p a c t   o n   p e r f o r m a n c e .   e . g .   O R T   e x p l i c i t l y   a s s i g n s   s h a p e   r e l a t e d   o p s   t o   C P U   t o   i m p r o v e   p e r f .
 2 0 2 4 - 0 6 - 0 3   2 3 : 5 4 : 4 3 . 9 2 4 8 2 2 5   [ W : o n n x r u n t i m e : ,   s e s s i o n _ s t a t e . c c : 1 1 7 1   o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ]   R e r u n n i n g   w i t h   v e r b o s e   o u t p u t   o n   a   n o n - m i n i m a l   b u i l d   w i l l   s h o w   n o d e   a s s i g n m e n t s .
 Test Passed
2 0 2 4 - 0 6 - 0 3   2 3 : 5 4 : 4 4 . 7 5 0 1 7 0 5   [ W : o n n x r u n t i m e : D e f a u l t ,   v i t i s a i _ e x e c u t i o n _ p r o v i d e r . c c : 7 4   o n n x r u n t i m e : : V i t i s A I E x e c u t i o n P r o v i d e r : : ~ V i t i s A I E x e c u t i o n P r o v i d e r ]   R e l e a s i n g   t h e   F l e x M L   E P   p o i n t e r   i n   V i t i s   A I   E P

If you see "Test Passed", everything is working properly. Congratulations!

After reading the official documentation about Runtime Setup(1.1)(You can choose between Throughput Profile and Latency Profile for the NPU settings), you can proceed with your project using this environment.

When you log out and start working again, don't forget to activate the virtual environment with CMD first. If something goes wrong with the environment, you can reset it by creating a new virtual environment.

conda activate ryzenai-1.1-YYYYMMDD-HHMMSS

Once you've completed this, you can proceed with your own project. We recommend that you refer to the amd/RyzenAI-SW 1.1 branch sample on Github. Good luck!

If your project is related to LLM, please continue reading below.

Running the LLama2 AWQ sample

At first you need git for windows for download files from github.

So download and install it. I used Git-2.44.0-64-bit.exe.

Basically, proceed as described in https://github.com/amd/RyzenAI-SW/tree/1.1, but there is one big catch.

There are some commands that are supposed to be run in CMD and some that are supposed to be run in Powershell, but this is not always clear. The following steps will result in an error if they are not run in CMD. However, there were times when PowerShell was required in the RyzenAI_quant_tutorial tutorial.

Let's start.

Decide on a working folder and run the following command there to download the file:

git lfs install
git clone https://github.com/amd/RyzenAI-SW.git
cd RyzenAI-SW
git lfs pull
git lfs fetch --all

Follow the instructions below to carry out the procedure.

https://github.com/amd/RyzenAI-SW/tree/1.1/example/transformers

Step 1: Download repository and create conda environment based on provided yaml file

cd example\transformers
conda env create --file=env.yaml
conda activate ryzenai-transformers

download precomputed results. It's over 32.3GB.

cd ext
git lfs install
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache

Step 2: Setup environment

cd ..
.\setup.bat

Step 3: Build dependencies

pip install ops\cpp --force-reinstall

You need CMD. If you use power shell, you encounter error like below.

CMake Error at CMakeLists.txt:15 (find_package):
Could not find a package configuration file provided by "XRT" with any of
the following names:
XRTConfig.cmake
xrt-config.cmake
Add the installation prefix of "XRT" to CMAKE_PREFIX_PATH or set "XRT_DIR"
to a directory containing one of the above files.  If "XRT" provides a
separate development package or SDK, be sure it has been installed.

We skip Step 4 because our approach is PyTorch-based workflow.

Next, Follow the instructions below to carry out the procedure.

https://github.com/amd/RyzenAI-SW/tree/1.1/example/transformers/models/llama2/README.md

However, we don't need to do "Prepare Llama2 Weights to use with HF" because There is a model converted to HF (huggingface) format.

If you don't have one, create a huggingface account.

Then, Go to Llama-2-7b-chat-hf page and Apply to Meta.

Once approved, You can download the model using your browser, but if you want to use the git command for download, you will need to log in to huggingface_hub.

In that case, You may want to install huggingface-cli to get authentication.

After installing it according to the instructions on the official page, you can log in as follows:

Once logined, you can download the model and can run run_awq.py.

cd RyzenAI-SW\example\transformers\models\llama2
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
move  Llama-2-7b-chat-hf 7B_chat
mkdir llama-2-wts-hf
move 7B_chat llama-2-wts-hf
python run_awq.py --w_bit 4 --task quantize

It took about 7 minutes to complete.

logs

(ryzenai-transformers) C:\work\git\RyzenAI-SW\example\transformers\models\llama2>python run_awq.py --w_bit 4 --task quantize
C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
Namespace(dataset='raw', w_bit=4, awq='load', target='cpu', task='quantize', flash_attention=False, lm_head=False, num_torch_threads=8)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:12<00:00,  6.09s/it]
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (k_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (v_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (o_proj): Linear(in_features=4096, out_features=4096, bias=False)
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (up_proj): Linear(in_features=4096, out_features=11008, bias=False)
          (down_proj): Linear(in_features=11008, out_features=4096, bias=False)
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

**** Model size: 12916.516MB


Loading pre-computed AWQ results from C:\work\git\RyzenAI-SW\example\transformers\\ext\awq_cache\
Quantization config: {'zero_point': True, 'q_group_size': 128}
real weight quantization...: 100%|█████████████████████████████████████████████████████| 32/32 [05:44<00:00, 10.78s/it]

**** Model size: 6965.766MB


Model transformation: Replacing <class 'qmodule.WQLinear'> layers with <class 'qlinear.QLinearPerGrp'> ...
Model transformation done!: Replaced 224 <class 'qmodule.WQLinear'> layers with <class 'qlinear.QLinearPerGrp'>.
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (k_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (v_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (o_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (up_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (down_proj): ryzenAI.QLinearPerGrp(in_features:11008, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

**** Model size: 564.516MB


Quantized and saved model: pytorch_llama27b_w_bit_4_awq_amd.pt

Then, Run the inference.

python run_awq.py --task decode --target aie --w_bit 4

logs

(ryzenai-transformers) C:\work\git\RyzenAI-SW\example\transformers\models\llama2>python run_awq.py --w_bit 4 --task quantize

C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
  torch.utils._pytree._register_pytree_node(
Namespace(dataset='raw', w_bit=4, awq='load', target='aie', task='decode', flash_attention=False, lm_head=False, num_torch_threads=8)
Loading from ckpt: pytorch_llama27b_w_bit_4_awq_amd.pt

**** Model size: 564.516MB


LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (k_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (v_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (o_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (up_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (down_proj): ryzenAI.QLinearPerGrp(in_features:11008, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128  )
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
Preparing weights of layer : model.layers.0.self_attn.q_proj
Preparing weights of layer : model.layers.0.self_attn.k_proj
Preparing weights of layer : model.layers.0.self_attn.v_proj
Preparing weights of layer : model.layers.0.self_attn.o_proj
Preparing weights of layer : model.layers.0.mlp.gate_proj
Preparing weights of layer : model.layers.0.mlp.up_proj
Preparing weights of layer : model.layers.0.mlp.down_proj
Preparing weights of layer : model.layers.1.self_attn.q_proj
Preparing weights of layer : model.layers.1.self_attn.k_proj
Preparing weights of layer : model.layers.1.self_attn.v_proj
Preparing weights of layer : model.layers.1.self_attn.o_proj
Preparing weights of layer : model.layers.1.mlp.gate_proj
Preparing weights of layer : model.layers.1.mlp.up_proj
Preparing weights of layer : model.layers.1.mlp.down_proj
Preparing weights of layer : model.layers.2.self_attn.q_proj
Preparing weights of layer : model.layers.2.self_attn.k_proj
Preparing weights of layer : model.layers.2.self_attn.v_proj
Preparing weights of layer : model.layers.2.self_attn.o_proj
Preparing weights of layer : model.layers.2.mlp.gate_proj
Preparing weights of layer : model.layers.2.mlp.up_proj
Preparing weights of layer : model.layers.2.mlp.down_proj
Preparing weights of layer : model.layers.3.self_attn.q_proj
Preparing weights of layer : model.layers.3.self_attn.k_proj
Preparing weights of layer : model.layers.3.self_attn.v_proj
Preparing weights of layer : model.layers.3.self_attn.o_proj
Preparing weights of layer : model.layers.3.mlp.gate_proj
Preparing weights of layer : model.layers.3.mlp.up_proj
Preparing weights of layer : model.layers.3.mlp.down_proj
Preparing weights of layer : model.layers.4.self_attn.q_proj
Preparing weights of layer : model.layers.4.self_attn.k_proj
Preparing weights of layer : model.layers.4.self_attn.v_proj
Preparing weights of layer : model.layers.4.self_attn.o_proj
Preparing weights of layer : model.layers.4.mlp.gate_proj
Preparing weights of layer : model.layers.4.mlp.up_proj
Preparing weights of layer : model.layers.4.mlp.down_proj
Preparing weights of layer : model.layers.5.self_attn.q_proj
Preparing weights of layer : model.layers.5.self_attn.k_proj
Preparing weights of layer : model.layers.5.self_attn.v_proj
Preparing weights of layer : model.layers.5.self_attn.o_proj
Preparing weights of layer : model.layers.5.mlp.gate_proj
Preparing weights of layer : model.layers.5.mlp.up_proj
Preparing weights of layer : model.layers.5.mlp.down_proj
Preparing weights of layer : model.layers.6.self_attn.q_proj
Preparing weights of layer : model.layers.6.self_attn.k_proj
Preparing weights of layer : model.layers.6.self_attn.v_proj
Preparing weights of layer : model.layers.6.self_attn.o_proj
Preparing weights of layer : model.layers.6.mlp.gate_proj
Preparing weights of layer : model.layers.6.mlp.up_proj
Preparing weights of layer : model.layers.6.mlp.down_proj
Preparing weights of layer : model.layers.7.self_attn.q_proj
Preparing weights of layer : model.layers.7.self_attn.k_proj
Preparing weights of layer : model.layers.7.self_attn.v_proj
Preparing weights of layer : model.layers.7.self_attn.o_proj
Preparing weights of layer : model.layers.7.mlp.gate_proj
Preparing weights of layer : model.layers.7.mlp.up_proj
Preparing weights of layer : model.layers.7.mlp.down_proj
Preparing weights of layer : model.layers.8.self_attn.q_proj
Preparing weights of layer : model.layers.8.self_attn.k_proj
Preparing weights of layer : model.layers.8.self_attn.v_proj
Preparing weights of layer : model.layers.8.self_attn.o_proj
Preparing weights of layer : model.layers.8.mlp.gate_proj
Preparing weights of layer : model.layers.8.mlp.up_proj
Preparing weights of layer : model.layers.8.mlp.down_proj
Preparing weights of layer : model.layers.9.self_attn.q_proj
Preparing weights of layer : model.layers.9.self_attn.k_proj
Preparing weights of layer : model.layers.9.self_attn.v_proj
Preparing weights of layer : model.layers.9.self_attn.o_proj
Preparing weights of layer : model.layers.9.mlp.gate_proj
Preparing weights of layer : model.layers.9.mlp.up_proj
Preparing weights of layer : model.layers.9.mlp.down_proj
Preparing weights of layer : model.layers.10.self_attn.q_proj
Preparing weights of layer : model.layers.10.self_attn.k_proj
Preparing weights of layer : model.layers.10.self_attn.v_proj
Preparing weights of layer : model.layers.10.self_attn.o_proj
Preparing weights of layer : model.layers.10.mlp.gate_proj
Preparing weights of layer : model.layers.10.mlp.up_proj
Preparing weights of layer : model.layers.10.mlp.down_proj
Preparing weights of layer : model.layers.11.self_attn.q_proj
Preparing weights of layer : model.layers.11.self_attn.k_proj
Preparing weights of layer : model.layers.11.self_attn.v_proj
Preparing weights of layer : model.layers.11.self_attn.o_proj
Preparing weights of layer : model.layers.11.mlp.gate_proj
Preparing weights of layer : model.layers.11.mlp.up_proj
Preparing weights of layer : model.layers.11.mlp.down_proj
Preparing weights of layer : model.layers.12.self_attn.q_proj
Preparing weights of layer : model.layers.12.self_attn.k_proj
Preparing weights of layer : model.layers.12.self_attn.v_proj
Preparing weights of layer : model.layers.12.self_attn.o_proj
Preparing weights of layer : model.layers.12.mlp.gate_proj
Preparing weights of layer : model.layers.12.mlp.up_proj
Preparing weights of layer : model.layers.12.mlp.down_proj
Preparing weights of layer : model.layers.13.self_attn.q_proj
Preparing weights of layer : model.layers.13.self_attn.k_proj
Preparing weights of layer : model.layers.13.self_attn.v_proj
Preparing weights of layer : model.layers.13.self_attn.o_proj
Preparing weights of layer : model.layers.13.mlp.gate_proj
Preparing weights of layer : model.layers.13.mlp.up_proj
Preparing weights of layer : model.layers.13.mlp.down_proj
Preparing weights of layer : model.layers.14.self_attn.q_proj
Preparing weights of layer : model.layers.14.self_attn.k_proj
Preparing weights of layer : model.layers.14.self_attn.v_proj
Preparing weights of layer : model.layers.14.self_attn.o_proj
Preparing weights of layer : model.layers.14.mlp.gate_proj
Preparing weights of layer : model.layers.14.mlp.up_proj
Preparing weights of layer : model.layers.14.mlp.down_proj
Preparing weights of layer : model.layers.15.self_attn.q_proj
Preparing weights of layer : model.layers.15.self_attn.k_proj
Preparing weights of layer : model.layers.15.self_attn.v_proj
Preparing weights of layer : model.layers.15.self_attn.o_proj
Preparing weights of layer : model.layers.15.mlp.gate_proj
Preparing weights of layer : model.layers.15.mlp.up_proj
Preparing weights of layer : model.layers.15.mlp.down_proj
Preparing weights of layer : model.layers.16.self_attn.q_proj
Preparing weights of layer : model.layers.16.self_attn.k_proj
Preparing weights of layer : model.layers.16.self_attn.v_proj
Preparing weights of layer : model.layers.16.self_attn.o_proj
Preparing weights of layer : model.layers.16.mlp.gate_proj
Preparing weights of layer : model.layers.16.mlp.up_proj
Preparing weights of layer : model.layers.16.mlp.down_proj
Preparing weights of layer : model.layers.17.self_attn.q_proj
Preparing weights of layer : model.layers.17.self_attn.k_proj
Preparing weights of layer : model.layers.17.self_attn.v_proj
Preparing weights of layer : model.layers.17.self_attn.o_proj
Preparing weights of layer : model.layers.17.mlp.gate_proj
Preparing weights of layer : model.layers.17.mlp.up_proj
Preparing weights of layer : model.layers.17.mlp.down_proj
Preparing weights of layer : model.layers.18.self_attn.q_proj
Preparing weights of layer : model.layers.18.self_attn.k_proj
Preparing weights of layer : model.layers.18.self_attn.v_proj
Preparing weights of layer : model.layers.18.self_attn.o_proj
Preparing weights of layer : model.layers.18.mlp.gate_proj
Preparing weights of layer : model.layers.18.mlp.up_proj
Preparing weights of layer : model.layers.18.mlp.down_proj
Preparing weights of layer : model.layers.19.self_attn.q_proj
Preparing weights of layer : model.layers.19.self_attn.k_proj
Preparing weights of layer : model.layers.19.self_attn.v_proj
Preparing weights of layer : model.layers.19.self_attn.o_proj
Preparing weights of layer : model.layers.19.mlp.gate_proj
Preparing weights of layer : model.layers.19.mlp.up_proj
Preparing weights of layer : model.layers.19.mlp.down_proj
Preparing weights of layer : model.layers.20.self_attn.q_proj
Preparing weights of layer : model.layers.20.self_attn.k_proj
Preparing weights of layer : model.layers.20.self_attn.v_proj
Preparing weights of layer : model.layers.20.self_attn.o_proj
Preparing weights of layer : model.layers.20.mlp.gate_proj
Preparing weights of layer : model.layers.20.mlp.up_proj
Preparing weights of layer : model.layers.20.mlp.down_proj
Preparing weights of layer : model.layers.21.self_attn.q_proj
Preparing weights of layer : model.layers.21.self_attn.k_proj
Preparing weights of layer : model.layers.21.self_attn.v_proj
Preparing weights of layer : model.layers.21.self_attn.o_proj
Preparing weights of layer : model.layers.21.mlp.gate_proj
Preparing weights of layer : model.layers.21.mlp.up_proj
Preparing weights of layer : model.layers.21.mlp.down_proj
Preparing weights of layer : model.layers.22.self_attn.q_proj
Preparing weights of layer : model.layers.22.self_attn.k_proj
Preparing weights of layer : model.layers.22.self_attn.v_proj
Preparing weights of layer : model.layers.22.self_attn.o_proj
Preparing weights of layer : model.layers.22.mlp.gate_proj
Preparing weights of layer : model.layers.22.mlp.up_proj
Preparing weights of layer : model.layers.22.mlp.down_proj
Preparing weights of layer : model.layers.23.self_attn.q_proj
Preparing weights of layer : model.layers.23.self_attn.k_proj
Preparing weights of layer : model.layers.23.self_attn.v_proj
Preparing weights of layer : model.layers.23.self_attn.o_proj
Preparing weights of layer : model.layers.23.mlp.gate_proj
Preparing weights of layer : model.layers.23.mlp.up_proj
Preparing weights of layer : model.layers.23.mlp.down_proj
Preparing weights of layer : model.layers.24.self_attn.q_proj
Preparing weights of layer : model.layers.24.self_attn.k_proj
Preparing weights of layer : model.layers.24.self_attn.v_proj
Preparing weights of layer : model.layers.24.self_attn.o_proj
Preparing weights of layer : model.layers.24.mlp.gate_proj
Preparing weights of layer : model.layers.24.mlp.up_proj
Preparing weights of layer : model.layers.24.mlp.down_proj
Preparing weights of layer : model.layers.25.self_attn.q_proj
Preparing weights of layer : model.layers.25.self_attn.k_proj
Preparing weights of layer : model.layers.25.self_attn.v_proj
Preparing weights of layer : model.layers.25.self_attn.o_proj
Preparing weights of layer : model.layers.25.mlp.gate_proj
Preparing weights of layer : model.layers.25.mlp.up_proj
Preparing weights of layer : model.layers.25.mlp.down_proj
Preparing weights of layer : model.layers.26.self_attn.q_proj
Preparing weights of layer : model.layers.26.self_attn.k_proj
Preparing weights of layer : model.layers.26.self_attn.v_proj
Preparing weights of layer : model.layers.26.self_attn.o_proj
Preparing weights of layer : model.layers.26.mlp.gate_proj
Preparing weights of layer : model.layers.26.mlp.up_proj
Preparing weights of layer : model.layers.26.mlp.down_proj
Preparing weights of layer : model.layers.27.self_attn.q_proj
Preparing weights of layer : model.layers.27.self_attn.k_proj
Preparing weights of layer : model.layers.27.self_attn.v_proj
Preparing weights of layer : model.layers.27.self_attn.o_proj
Preparing weights of layer : model.layers.27.mlp.gate_proj
Preparing weights of layer : model.layers.27.mlp.up_proj
Preparing weights of layer : model.layers.27.mlp.down_proj
Preparing weights of layer : model.layers.28.self_attn.q_proj
Preparing weights of layer : model.layers.28.self_attn.k_proj
Preparing weights of layer : model.layers.28.self_attn.v_proj
Preparing weights of layer : model.layers.28.self_attn.o_proj
Preparing weights of layer : model.layers.28.mlp.gate_proj
Preparing weights of layer : model.layers.28.mlp.up_proj
Preparing weights of layer : model.layers.28.mlp.down_proj
Preparing weights of layer : model.layers.29.self_attn.q_proj
Preparing weights of layer : model.layers.29.self_attn.k_proj
Preparing weights of layer : model.layers.29.self_attn.v_proj
Preparing weights of layer : model.layers.29.self_attn.o_proj
Preparing weights of layer : model.layers.29.mlp.gate_proj
Preparing weights of layer : model.layers.29.mlp.up_proj
Preparing weights of layer : model.layers.29.mlp.down_proj
Preparing weights of layer : model.layers.30.self_attn.q_proj
Preparing weights of layer : model.layers.30.self_attn.k_proj
Preparing weights of layer : model.layers.30.self_attn.v_proj
Preparing weights of layer : model.layers.30.self_attn.o_proj
Preparing weights of layer : model.layers.30.mlp.gate_proj
Preparing weights of layer : model.layers.30.mlp.up_proj
Preparing weights of layer : model.layers.30.mlp.down_proj
Preparing weights of layer : model.layers.31.self_attn.q_proj
Preparing weights of layer : model.layers.31.self_attn.k_proj
Preparing weights of layer : model.layers.31.self_attn.v_proj
Preparing weights of layer : model.layers.31.self_attn.o_proj
Preparing weights of layer : model.layers.31.mlp.gate_proj
Preparing weights of layer : model.layers.31.mlp.up_proj
Preparing weights of layer : model.layers.31.mlp.down_proj
LlamaForCausalLM(
  (model): LlamaModel(
    (embed_tokens): Embedding(32000, 4096)
    (layers): ModuleList(
      (0-31): 32 x LlamaDecoderLayer(
        (self_attn): LlamaAttention(
          (q_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (k_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (v_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (o_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (rotary_emb): LlamaRotaryEmbedding()
        )
        (mlp): LlamaMLP(
          (gate_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:None, device:aie, w_bit:4 group_size:128  )
          (up_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:None, device:aie, w_bit:4 group_size:128  )
          (down_proj): ryzenAI.QLinearPerGrp(in_features:11008, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128  )
          (act_fn): SiLUActivation()
        )
        (input_layernorm): LlamaRMSNorm()
        (post_attention_layernorm): LlamaRMSNorm()
      )
    )
    (norm): LlamaRMSNorm()
  )
  (lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)

**** Model size: 564.512MB


Warming up ...
Warm up DONE!!
****************************************
prompt: What is the meaning of life?
response: What is the meaning of life?
The question of the meaning of life is a philosoph
****************************************
prompt: Tell me something you don't know.
response: Tell me something you don't know.

I don't know if you're
****************************************
prompt: What does Xilinx do?
response: What does Xilinx do?
Xilinx is a leading provider of programm
****************************************
prompt: What is the mass of earth?
response: What is the mass of earth?

The mass of Earth is approximately 5.
****************************************
prompt: What is a poem?
response: What is a poem?

A poem is a piece of writing that uses
****************************************
prompt: What is recursion?
response: What is recursion?

Recursion is a programming technique where a
****************************************
prompt: Tell me a one line joke.
response: Tell me a one line joke.

Here is a one-liner for you
****************************************
prompt: Who is Gilgamesh?
response: Who is Gilgamesh?
Gilgamesh is a legendary king
****************************************
prompt: Tell me something about cryptocurrency.
response: Tell me something about cryptocurrency.
Cryptocurrency is a digital or virtual currency
****************************************
prompt: How did it all begin?
response: How did it all begin?

The concept of the "end times" has
Number of prompts found in log: 10
Example#:1      Prompt-len:8    New-tokens-generated:11 Total-time:3.514s       Prefill-phase:549.933ms Time/token:293mTokens/sec:3.4
Example#:2      Prompt-len:10   New-tokens-generated:11 Total-time:3.936s       Prefill-phase:985.360ms Time/token:294mTokens/sec:3.4
Example#:3      Prompt-len:8    New-tokens-generated:11 Total-time:3.487s       Prefill-phase:531.689ms Time/token:291mTokens/sec:3.4
Example#:4      Prompt-len:8    New-tokens-generated:11 Total-time:3.460s       Prefill-phase:542.736ms Time/token:288mTokens/sec:3.5
Example#:5      Prompt-len:6    New-tokens-generated:11 Total-time:3.446s       Prefill-phase:503.883ms Time/token:294mTokens/sec:3.4
Example#:6      Prompt-len:5    New-tokens-generated:11 Total-time:3.435s       Prefill-phase:503.075ms Time/token:290mTokens/sec:3.4
Example#:7      Prompt-len:9    New-tokens-generated:11 Total-time:3.930s       Prefill-phase:960.514ms Time/token:297mTokens/sec:3.4
Example#:8      Prompt-len:8    New-tokens-generated:11 Total-time:3.455s       Prefill-phase:531.682ms Time/token:289mTokens/sec:3.5
Example#:9      Prompt-len:9    New-tokens-generated:11 Total-time:3.893s       Prefill-phase:955.288ms Time/token:294mTokens/sec:3.4
Example#:10     Prompt-len:7    New-tokens-generated:11 Total-time:3.508s       Prefill-phase:518.723ms Time/token:295mTokens/sec:3.4

(ryzenai-transformers) C:\work\git\RyzenAI-SW\example\transformers\models\llama2>

OK, Congratulations!

When you log out and start working again, don't forget to activate the virtual environment with CMD first. Then run setup.bat again.

conda activate ryzenai-transformers
\RyzenAI-SW\example\transformers>.\setup.bat

If something goes wrong with the environment, you can reset it using the following command.

conda deacitvate
conda env remove --name <your_env_name>

then, creating a new virtual environment.

I hope your application changes the world for the better. Good luck!

Other Findings.

Using huggingface's optimum library, I was able to convert both llama2 7b and gemma 7b to ONNX format, but the accuracy was greatly reduced.
At the moment, it is unavoidable that accuracy will decrease when converting the model format. Therefore, it is important to avoid format conversion as much as possible. There seem to be two options: either use a model trained from scratch in ONNX format, or train a model in Hugging Face Transformers format and quantize it directly with AWQ using RyzenAI-SW. This project chose the latter.
I created an AWQ version of LLama 3, Llama3.1, and uploaded it to huggingface. It has been confirmed that it can speak English, Japanese, Chinese (Simplified), French, Korean, German, and Taiwanese (Traditional). If you want to focus on the application layer rather than porting the model, you should use it.
For Ryzen AI Software 1.2, AMD has a lot of models available, so I recommend checking out the RyzenAI Model Zoo.

What I did in this project

Now that the setup is complete, I'll briefly explain what I did in this project before running my ported model.

Step1 Model Training.

The trained model is already available for download at huggingface, so you don't need to perform this step. I'll briefly explain what I did and why.

(1)Continual Pre-training

The original Llama series model supports multiple languages, but Japanese and Chinese are not an officially supported language.

In order to give the model additional knowledge of Japanese and Chinese, we are conducting what is called Continual Pre-training. It is generally recommended to continue training until the loss graph becomes horizontal.

Llama 3.1 Continual Pre-training loss graph

(2)Fine-tuning

AI can perform a variety of tasks, but small models that run on PCs perform better when they are specially tuned for specific tasks. Here too, it is generally recommended to continue training until the evaluation loss graph becomes horizontal.

Llama 3.1 Finetune loss graph

Example output without fine-tuned model.

Original llama 3.1

The second line is output in romaji (a notation that uses the alphabet to write Japanese). Part of the first line is omitted.

Example output with fine-tunedmodel.

fine-tuned llama 3.1

The second line is completely written in Japanese. The first line is also translated without omitting.

Step 2 AWQ quantization

The models we created are very large, exceeding 10GB in size, so we used AWQ(Activation-aware Weight Quantization) using llm-awq by mit-han-lab to reduce the size of the models.

Please refer to the official llm-awq github page for installation and setup.

The work was done in a Linux environment. Please note that it was not possible to run the program due to insufficient memory unless the GPU had 40GB or more.

Benchmark results of the created model

In this project, the following six models were ported to Ryzen AI PC.

Llama2 7B: A large-scale language model for natural language processing developed by Meta, released on July 18, 2023. It is an improved version of Llama 1.
Llama3 8B: The next-generation version of Llama 2, announced by Meta on April 18, 2024. This model features more advanced natural language processing functions and improved performance.
Llama3.1 8B: The latest model with significant improvements over Llama 3, announced by Meta on July 23, 2024.
ALMA-Ja-V3: Llama2 7B with additional training to improve translation capabilities.
llama3-8b_translation: Llama3 with additional training to improve translation capabilities.
Llama-translate: Llama3.1 with additional training to improve translation capabilities.

Llama3 8B and Llama 3.1 8B are general-purpose models and are thought to be useful outside of this project, so they were made public on HuggingFace.

We also compared the performance of Llama2 7B, the official AMD implementation of AWQ quantization, in terms of perplexity scores.

Comparison of perplexity scores

Since lower perplexity score is better, it was confirmed that the our LLama2 version has improved performance compared to the official implementation.

Perplexity scores cannot be compared simply between different models, so it is not a problem that LLama3 is higher than LLama2. It was confirmed that LLama3.1 has a lower perplexity score than LLama3, which reflects the performance improvement of the base model.

Benchmarking on real tasks

We compared the performance of the following two models created for the translation task:

ALMA-Ja-V3

Based on llama2 7b, Improved translation capabilities between Japanese and English.

Llama translate

Based on llama3.1 8b, Improved translation capabilities between English, Japanese, Chinese, and French.

Llama-translate vs. ALMA-Ja-V3

The benchmark used for comparison is flores200, a multilingual translation test created by Meta. The metrics are based on Unbabel/COMET's xcomet-xl, which is said to be close to human evaluation.

Llama-translate outperforms ALMA-Ja-V3. ALMA-Ja-V3 tends to omit long sentences, but it runs quickly.

One more benchmark results

LLama3 translate vs. Google translate vs. ALMA-Ja-V3

We compared the multilingual translation function with Google Translate using mini flores200, which has a reduced amount of data.

There is almost no difference between jaen, fren, and cnen. So we can say that its performance is almost as good as Google Translate at least from X to English.

But please note that these scores vary depending on the length of the sentence and the text category (formal/informal), so it is important to test them using actual sentences.

Model download and sample script

All four models created in this project have been uploaded to huggingface. There is a setup instructions and sample scripts are also included.

llama3-8b-amd-npu General-purpose model
llama3.1-8b-Instruct-amd-npu General-purpose model
ALMA-Ja-V3-amd-npu Models for translation tasks
llama-translate-amd-npu Models for translation tasks

You can try out these models and incorporate them into your own projects.

Sample application

Let's create a real application using these models.

The Olympics are underway in France.

Olympics live update site

Wouldn't it be great if you could get real-time Olympic live updates from a website and delivered in your language?

Let's get ready. Please note that this is a 1.1 document, so file locations etc. are different in 1.2.

(1)Start the conda environment and run the setup.bat

conda activate ryzenai-transformers
<your_install_path>\RyzenAI-SW\example\transformers\setup.bat

(2)Install the necessary libraries

pip install selenium
pip install webdriver_manager
pip install -U "huggingface_hub[cli]"
pip install transformers==4.43.3
# Updating the Transformers library will cause the LLama 2 sample to stop working.  
# If you want to run LLama 2 again, revert to pip install transformers==4.34.0.

(3)download model

huggingface-cli download dahara1/llama-translate-amd-npu --revision main --local-dir llama-translate-amd-npu

(4)Copy modules and Set runtime

copy <your_ryzen_ai-sw_install_path>\RyzenAI-SW\example\transformers\models\llama2\modeling_llama_amd.py .

# set up Runtime. see https://ryzenai.docs.amd.com/en/latest/runtime_setup.html
set XLNX_VART_FIRMWARE=<your_firmware_install_path>\voe-4.0-win_amd64\1x4.xclbin
set NUM_OF_DPU_RUNNERS=1

(5)We'll allow our python program to communicate through the firewall.

The following screen is a little different because it is from the Japanese version of Windows 11. However, the steps are the same.

Type def in the bottom center of the screen input field. and click windows Defender fire walls.

run windows defender firewall

Next, Push "Allow an app or feature through Windows Defender Firewall"

Allow an app or feature through Windows Defender Firewall

Next, Push "Change settings" then, "Allow another app..."

Change settings

Next, Push "Browse..."

Browse

Next, Search your conda or miniconda env path and select python.exe and Open.

Search conda env path.

Next, Push Add Button.

Add python exe

Next, After confirming that python has been added, click OK to close the window.

confirm python

This operation will enable python to communicate with the Internet, but if you do not plan to use it in the future, you can remove python by following the steps in reverse after the sample script has finished running.

For the script, please refer to the view_olympic_llama-translate.py in the Code section of this page.

Below you can see the script in action.

translation script sample

Other interesting uses would be combining it with OCR software, transcription software, free games, etc.

Why on-device model is important

This system offers several advantages, including low power consumption, quiet operation, and low temperature. It can operate without API registration, ensuring privacy protection and eliminating concerns about data leaks.

Many recent services claim that user-entered data will be used for AI learning. However, there may be instances where users input text they do not own the rights to or conversation logs they have not agreed to share. Even services that state they will not use data for AI learning can pose risks if the data is leaked by the company, often resulting in unsatisfactory compensation.

Local AI is easy to customize and works in offline environments. As privacy protections are increasingly reconsidered, local AI is likely to see wider adoption.

Another advantage of local one-task AI models is the absence of refusal to perform certain tasks. For example, when attempting to use chatGPT to translate news related to terrorism, I received the response, "I can't cooperate with terrorism." This limitation is not a concern with local one-task AI models.

I hope that this model will become a useful AI for everyone.

Limitation

The llama-translate model does not fully implement RoPE and the context length is shorter than the original Llama 3.1.
The llama-translate model does not implement Flash Attention.
The llama-translate model may not be able to fully utilize the performance of the NPU. This may be improved in Version 1.2, as a profiler is now available.
The execution speed is still slow. It is not yet practical for translating in applications that require real-time performance, such as chat. However, Version 1.2 includes speed improvements such as speculative decoding, and the hardware I used this time is TOPs 10, but a TOPs 40 machine is scheduled to be released in the fall, so this problem is expected to gradually improve.

License

If you want to incorporate these models into your projects, please check out the meta community license.

llama-translate-amd-npu is based on Meta-Llama-3.1-8B-Instruct. So if you want to incorporate these models into your projects, please check out the LLAMA 3.1 COMMUNITY LICENSE and follow it. It is a commercially available license, but has naming and credit restrictions.

ALMA-Ja-V3-amd-npu is based on Llama-2-7b-hf. So please read and follow the LLAMA 2 COMMUNITY LICENSE.

Also see the Meta Llama 3 Acceptable Use Policy and Responsible Use Guide.

Reference information and Acknowledgements

Here is some information that may be helpful. Thank you.

(1)AMD Pervasive AI Developer Contest
Contest Page.

(2)AMD Pervasive AI Developer Contest PC AI Study Guide
PC AI Contest Details.

(3)AMD Ryzen™ AI Software
Software Development Guide

(4)Ryzen AI: Getting Started Guide
start guide

(5)RyzenAI-SW
example Reference Code

(6)Riallto
exploration framework for the AMD Ryzen AI NPU

(7)amd community support forum
AMD Communities for AI

(8)hackster.io discord
PC AI Discord Channel

(9)Optimum-AMD
Hugging Face libraries enabling performance optimizations for ROCm for AMD GPUs and Ryzen AI for AMD NPU accelerator.

(10)meta-llama
Built with Meta Llama 3 and LLama 2

Appendix: Tips for running Ryzen AI Software 1.2

This is a tip for those who have already set up Ryzen AI Software 1.1 to migrate to 1.2. When installing 1.2 for the first time, please refer to the documentation for 1.1 as appropriate.

Basically, follow the official instructions given here.

Ryzen AI Software 1.2 Installation Instructions

Major changes

Depending on your CPU model, you need to select the appropriate setup.bat or runtime setup etc.

Phoenix (PHX): AMD Ryzen™ 7940HS, 7840HS, 7640HS, 7840U, 7640U.
Hawk (HPT): AMD Ryzen™ 8640U, 8640HS, 8645H, 8840U, 8840HS, 8845H, 8945H.
Strix (STX): AMD Ryzen™ Ryzen AI 9 HX370, Ryzen AI 9 365

The competition participants are using 7940HS, so from here on we will assume that the CPU is PHX.

Software that needs to be installed

(1)Visual Studio 2022 Community

Downloads VC 2022 fee version from here. Maybe you need create Microsoft account.

We need check "Desktop Development with C++". (I also selected "python".)

VC install selection

For now, it works fine without deleting Visual Studio 2019 (required for Ryzen AI Software 1.1)

(2)cmake version >= 3.26

This should not be a problem, but please refer to the description in 1.1 of this page if necessary.

(3)Anaconda or Miniconda Latest version

In the explanation of 1.1 on this page, the path is set as a user environment variable. However, in 1.2, it needs to be set as a system environment variable. Set it at the bottom of the screen instead of the top.

conda setting

Installing the NPU Driver

Download from the link below. Maybe you need create AMD account.

https://account.amd.com/en/forms/downloads/ryzen-ai-software-platform-xef.html?filename=NPU_RAI1.2_20240729.zip

Extract the zip file and run the following command as administrator.

running setup bat

If you don't know how to run as administrator, please refer to the description in 1.1 on this page.

After the installation is complete, check that the date is 2024/07/26 in Device Manager. If it is not 2024/07/26, the setup has failed, so please check the documentation and instructions again.

Install RyzenAI Software

Download RyzenAI Software 1.2 from here.

This will create a Conda environment called ryzen-ai-1.2.0.

If you do not set it to a system environment variable, you will get an error that conda cannot be found.

This RyzenAI Software 1.2 is set up to run CNN models (i.e. models for people who work with images), which is not relevant for LLM, but let's run it to the end to see how it works.

Start the conda virtual environment you created for testing.

conda activate ryzen-ai-1.2.0

Runtime setup

This must be done every time before starting the application, whenever you use the NPU, regardless of whether you are using CNN or LMM.

The official instructions are on this page. As mentioned above, use different ones depending on the CPU model.

As a 7940HS owner, I use this.

set XLNX_VART_FIRMWARE=%RYZEN_AI_INSTALLATION_PATH%/voe-4.0-win_amd64/xclbins/phoenix/1x4.xclbin
set XLNX_TARGET_NAME=AMD_AIE2_Nx4_Overlay

Test it! and maybe fail it!

Let's run the following test.

cd %RYZEN_AI_INSTALLATION_PATH%/quicktest
python quicktest.py

If you are using a non-English version of Windows, the following error message may appear and the test may fail.

File "C:\Program Files\RyzenAI\1.2.0\quicktest\quicktest.py", line 16, in get_apu_info
    if 'PCI\\VEN_1022&DEV_1502&REV_00' in stdout.decode(): apu_type = 'PHX/HPT'
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 14: invalid start byte

I am using the Japanese version of Windows, so I was able to complete the test by making the following changes to the script.

script change

After some trial and error, if you pass the test, you should see the following screen.

test passed!

If your purpose is CNN, this completes the setup. For other purposes, such as LLM, you will need to perform a specific setup by referring to amd/RyzenAI-SW on github.

There are a lot of interesting updates in 1.2.

Support for llama.cpp gguf format (4_0 only, models before 2024 Apr 21)
Implementation of NPU profiler
Improved execution speed by speculative decoding

Unfortunately, it is still in the early release stage, so it often does not work as expected, but it will improve over time.

import torch
import psutil
import transformers
from transformers import AutoTokenizer, set_seed
import qlinear
import logging


def translation(instruction,  input):
    system =  """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a highly skilled professional translator. You are a native speaker of English, Japanese, French and Mandarin. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.<|eot_id|><|start_header_id|>user<|end_header_id|>"""

    prompt = f"""{system}
### Instruction:
{instruction}

### Input:
{input}

### Response:
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

    tokenized_input = tokenizer(prompt, return_tensors="pt",
        padding=True, max_length=1600, truncation=True)

    terminators = [
        tokenizer.eos_token_id,
    ]

    outputs = model.generate(tokenized_input['input_ids'],
            max_new_tokens=600,
            eos_token_id=terminators,
            attention_mask=tokenized_input['attention_mask'],
            do_sample=True,
            temperature=0.3,
            top_p=0.5)
    response = outputs[0][tokenized_input['input_ids'].shape[-1]:]
    response_message = tokenizer.decode(response, skip_special_tokens=True)
    return response_message


if __name__ == "__main__":


  transformers.logging.set_verbosity_error()
  logging.disable(logging.CRITICAL)

  set_seed(123)
  p = psutil.Process()
  p.cpu_affinity([0, 1, 2, 3])
  torch.set_num_threads(4)

  tokenizer = AutoTokenizer.from_pretrained("llama3.1-8b_translate-amd-npu")
  tokenizer.pad_token_id = tokenizer.add_special_tokens({'pad_token': '<|finetune_right_pad_id|>'})
  ckpt = r"llama-translate-amd-npu\llama3.1_8b_translate_w_bit_4_awq_amd.pt"

  model = torch.load(ckpt)
  model.eval()
  model = model.to(torch.bfloat16)

  for n, m in model.named_modules():
      if isinstance(m, qlinear.QLinearPerGrp):
          print(f"Preparing weights of layer : {n}")
          m.device = "aie"
          m.quantize_weights()



  print(translation("Translate Japanese to English.", "1月1日は日本の祝日です。その日は日曜日で、5日ぶりに雨が降りました"))
  print(translation("Translate English to Japanese.", "It’s raining cats and dogs."))
  print(translation("Translate French to Japanese.", "Après la pluie, le beau temps"))
  print(translation("Translate Mandarin to Japanese.", "要功夫深，铁杵磨成针"))

import torch
import psutil
import transformers
from transformers import AutoTokenizer, set_seed
import qlinear
import logging

def translate(instruction,  input):
    system =  """<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a highly skilled professional translator. You are a native speaker of English, Japanese, French and Mandarin. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.<|eot_id|><|start_header_id|>user<|end_header_id|>"""

    prompt = f"""{system}
### Instruction:
{instruction}

### Input:
{input}

### Response:
<|eot_id|><|start_header_id|>assistant<|end_header_id|>
"""

    tokenized_input = tokenizer(prompt, return_tensors="pt",
        padding=True, max_length=1600, truncation=True)

    terminators = [
        tokenizer.eos_token_id,
    ]

    outputs = model.generate(tokenized_input['input_ids'],
            max_new_tokens=600,
            eos_token_id=terminators,
            attention_mask=tokenized_input['attention_mask'],
            do_sample=True,
            temperature=0.3,
            top_p=0.5)
    response = outputs[0][tokenized_input['input_ids'].shape[-1]:]
    response_message = tokenizer.decode(response, skip_special_tokens=True)
    return response_message

set_seed(123)
p = psutil.Process()
p.cpu_affinity([0, 1, 2, 3])
torch.set_num_threads(4)
transformers.logging.set_verbosity_error()
logging.disable(logging.CRITICAL)

tokenizer = AutoTokenizer.from_pretrained("llama-translate-amd-npu")
tokenizer.pad_token_id = tokenizer.add_special_tokens({'pad_token': '<|finetune_right_pad_id|>'})
ckpt = r"llama-translate-amd-npu\llama3.1_8b_translate_w_bit_4_awq_amd.pt"

model = torch.load(ckpt)
model.eval()
model = model.to(torch.bfloat16)

for n, m in model.named_modules():
    if isinstance(m, qlinear.QLinearPerGrp):
        print(f"Preparing weights of layer : {n}")
        m.device = "aie"
        m.quantize_weights()

### end of llm setup.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.chrome.service import Service as ChromeService
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from webdriver_manager.chrome import ChromeDriverManager
from bs4 import BeautifulSoup
import time

# Setup Selenium with Chrome driver
options = webdriver.ChromeOptions()
# Comment out the headless option to see the browser window
#options.add_argument('--headless')
options.add_argument('--no-sandbox')
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--disable-blink-features=AutomationControlled')
options.add_argument('user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36')
options.add_argument('--lang=en')

driver = webdriver.Chrome(service=ChromeService(ChromeDriverManager().install()), options=options)

# URL of the live updates page
url = "https://olympics.com/en/paris-2024/live-updates"

# Function to fetch and display the latest news
def fetch_latest_news():
    # Open the page
    driver.get(url)
    
    # Wait for the page to load completely
    try:
        WebDriverWait(driver, 10).until(
            EC.presence_of_element_located((By.TAG_NAME, "body"))
        )
    except Exception as e:
        print("Page did not load in time:", e)
        driver.quit()
        exit()
    
    # Get the page source
    html_content = driver.page_source
    
    # Parse the HTML content using BeautifulSoup
    soup = BeautifulSoup(html_content, "html.parser")
    
    # Find all news sections with the specified class
    news_sections = soup.find_all("div", class_="PostItem-styles__PostPart-sc-3a9e76ca-5 fFkcFP d3lb-post__part d3lb-post__part--text")
    
    # Extract and print the content of each news section
    print("Latest News Texts:")
    for section in news_sections:
        news_text = section.get_text(separator="\n", strip=True)
        print(f"news_text: {news_text}")
        print(translate("Translate English to Mandarin.", news_text), flush=True)
        print("-" * 40, flush=True)  # Separator between news items

# Initial page load and cookie acceptance
driver.get(url)

# Wait for the page to load completely
try:
    WebDriverWait(driver, 10).until(
        EC.presence_of_element_located((By.TAG_NAME, "body"))
    )
except Exception as e:
    print("Page did not load in time:", e)
    driver.quit()
    exit()

# Accept cookies if the banner appears
try:
    accept_cookies_button = driver.find_element(By.ID, "onetrust-accept-btn-handler")
    accept_cookies_button.click()
    print("Successfully clicked the 'Yes, I am happy' button.")
except Exception as e:
    print("Failed to click the 'Yes, I am happy' button:")

# Wait for the page to process the cookie acceptance
time.sleep(3)

# Periodically fetch and display the latest news
while True:
    fetch_latest_news()
    time.sleep(60)  # Wait for 1 minute before fetching the news again

import torch
import psutil
import transformers
from transformers import AutoTokenizer, set_seed
import qlinear
import logging


def translation(instruction,  input):
    system =  """You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating."""
    prompt = f"""{system}

### Instruction:
{instruction}

### Input:
{input}

### Response:
"""

    tokenized_input = tokenizer(prompt, return_tensors="pt",
        padding=True, max_length=1600, truncation=True)

    terminators = [
        tokenizer.eos_token_id,
    ]

    outputs = model.generate(tokenized_input['input_ids'],
            max_new_tokens=600,
            eos_token_id=terminators,
            attention_mask=tokenized_input['attention_mask'],
            do_sample=True,
            temperature=0.3,
            top_p=0.5)
    response = outputs[0][tokenized_input['input_ids'].shape[-1]:]
    response_message = tokenizer.decode(response, skip_special_tokens=True)
    return response_message


if __name__ == "__main__":

  set_seed(123)
  p = psutil.Process()
  p.cpu_affinity([0, 1, 2, 3])
  torch.set_num_threads(4)
  transformers.logging.set_verbosity_error()
  logging.disable(logging.CRITICAL)

  tokenizer = AutoTokenizer.from_pretrained("ALMA-Ja-V3-amd-npu")
  tokenizer.pad_token = tokenizer.eos_token
  ckpt = r"ALMA-Ja-V3-amd-npu\alma_w_bit_4_awq_fa_amd.pt"

  model = torch.load(ckpt)
  model.eval()
  model = model.to(torch.bfloat16)
 

  for n, m in model.named_modules():
      if isinstance(m, qlinear.QLinearPerGrp):
          print(f"Preparing weights of layer : {n}")
          m.device = "aie"
          m.quantize_weights()


  print(translation("Translate Japanese to English.", "面白きこともなき世を面白く住みなすものは心なりけり"))
  print(translation("Translate English to Japanese.", "Join me, and together we can rule the galaxy as father and son."))

import torch
import psutil
import transformers
from transformers import AutoTokenizer, set_seed
import qlinear
import logging

set_seed(123)
transformers.logging.set_verbosity_error()
logging.disable(logging.CRITICAL)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
]

message_list = [
    "Who are you? ", 
    # Japanese
    "あなたの乗っている船の名前は何ですか？英語ではなく全て日本語だけを使って返事をしてください",
    # Chainese
    "你经历过的最危险的冒险是什么？请用中文回答所有问题，不要用英文。",
    # French
    "À quelle vitesse va votre bateau ? Veuillez répondre uniquement en français et non en anglais.",
    # Korean
    "당신은 그 배의 어디를 좋아합니까? 영어를 사용하지 않고 모두 한국어로 대답하십시오.",
    # German
    "Wie würde Ihr Schiffsname auf Deutsch lauten? Bitte antwortet alle auf Deutsch statt auf Englisch.", 
    # Taiwanese
    "您發現過的最令人驚奇的寶藏是什麼？請僅使用台語和繁體中文回答，不要使用英文。",
]


if __name__ == "__main__":
    p = psutil.Process()
    p.cpu_affinity([0, 1, 2, 3])
    torch.set_num_threads(4)

    tokenizer = AutoTokenizer.from_pretrained("llama3.1-8b-Instruct-amd-npu")
    ckpt = r"llama3.1-8b-Instruct-amd-npu\llama3.1_8b_w_bit_4_awq_amd.pt"
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    model = torch.load(ckpt)
    model.eval()
    model = model.to(torch.bfloat16)

    for n, m in model.named_modules():
        if isinstance(m, qlinear.QLinearPerGrp):
            print(f"Preparing weights of layer : {n}")
            m.device = "aie"
            m.quantize_weights()

    print("system: " + messages[0]['content'])

    for i in range(len(message_list)):
        messages.append({"role": "user",  "content": message_list[i]})
        print("user: " + message_list[i])

        input = tokenizer.apply_chat_template(
            messages,
            add_generation_prompt=True,
            return_tensors="pt",
            return_dict=True
        )

        outputs = model.generate(input['input_ids'],
        max_new_tokens=600,
            eos_token_id=terminators,
        attention_mask=input['attention_mask'],
            do_sample=True,
            temperature=0.6,
            top_p=0.9)

        response = outputs[0][input['input_ids'].shape[-1]:]
        response_message = tokenizer.decode(response, skip_special_tokens=True)
        print("assistant: " + response_message)
        messages.append({"role": "system", "content": response_message})

import torch
import time
import os
import psutil
import transformers
from transformers import AutoTokenizer, set_seed
import qlinear
import logging

set_seed(123)
transformers.logging.set_verbosity_error()
logging.disable(logging.CRITICAL)

messages = [
    {"role": "system", "content": "You are a pirate chatbot who always responds in pirate speak!"},
]

message_list = [
    "Who are you? ", 
    # Japanese
    "あなたの乗っている船の名前は何ですか？英語ではなく全て日本語だけを使って返事をしてください",
    # Chainese
    "你经历过的最危险的冒险是什么？请用中文回答所有问题，不要用英文。",
    # French
    "À quelle vitesse va votre bateau ? Veuillez répondre uniquement en français et non en anglais.",
    # Korean
    "당신은 그 배의 어디를 좋아합니까? 영어를 사용하지 않고 모두 한국어로 대답하십시오.",
    # German
    "Wie würde Ihr Schiffsname auf Deutsch lauten? Bitte antwortet alle auf Deutsch statt auf Englisch.", 
    # Taiwanese
    "您發現過的最令人驚奇的寶藏是什麼？請僅使用台語和繁體中文回答，不要使用英文。",
]


if __name__ == "__main__":
    p = psutil.Process()
    p.cpu_affinity([0, 1, 2, 3])
    torch.set_num_threads(4)

    tokenizer = AutoTokenizer.from_pretrained("llama3-8b-amd-npu")
    ckpt = "llama3-8b-amd-npu/pytorch_llama3_8b_w_bit_4_awq_amd.pt"
    terminators = [
        tokenizer.eos_token_id,
        tokenizer.convert_tokens_to_ids("<|eot_id|>")
    ]
    model = torch.load(ckpt)
    model.eval()
    model = model.to(torch.bfloat16)

    for n, m in model.named_modules():
        if isinstance(m, qlinear.QLinearPerGrp):
            print(f"Preparing weights of layer : {n}")
            m.device = "aie"
            m.quantize_weights()

    print("system: " + messages[0]['content'])

    for i in range(len(message_list)):
        messages.append({"role": "user",  "content": message_list[i]})
        print("user: " + message_list[i])

        input = tokenizer.apply_chat_template(
        messages,
        add_generation_prompt=True,
        return_tensors="pt",
        return_dict=True
        )

        outputs = model.generate(input['input_ids'],
        max_new_tokens=600,
            eos_token_id=terminators,
        attention_mask=input['attention_mask'],
            do_sample=True,
            temperature=0.6,
            top_p=0.9)

        response = outputs[0][input['input_ids'].shape[-1]:]
        response_message = tokenizer.decode(response, skip_special_tokens=True)
        print("assistant: " + response_message)
        messages.append({"role": "system", "content": response_message})

Running LLM on AMD NPU Hardware

Things used in this project

Hardware components

Story

Important Note

About this project

Frequently reported problems during setup (For Ryzen AI Software 1.1)

How to setup RyzenAI software 1.1

Running the LLama2 AWQ sample

What I did in this project

Model download and sample script

Sample application

Why on-device model is important

Limitation

License

Reference information and Acknowledgements

Appendix: Tips for running Ryzen AI Software 1.2

Code

llama-translate-amd-npu translation sample

view_olympic_llama-translate.py

ALMA-Ja-V3-amd-npu translation sample code

llama3.1-8b-Instruct-amd-npu sample code

llama3-8b-amd-npu sample code

Credits

goichi harada

Comments

Embed the widget on your own site

Running LLM on AMD NPU Hardware

Running LLM on AMD NPU Hardware

Things used in this project

Hardware components

Story

Important Note

About this project

Frequently reported problems during setup (For Ryzen AI Software 1.1)

How to setup RyzenAI software 1.1

Running the LLama2 AWQ sample

What I did in this project

Model download and sample script

Sample application

Why on-device model is important

Limitation

License

Reference information and Acknowledgements

Appendix: Tips for running Ryzen AI Software 1.2

Code

llama-translate-amd-npu translation sample

view_olympic_llama-translate.py

ALMA-Ja-V3-amd-npu translation sample code

llama3.1-8b-Instruct-amd-npu sample code

llama3-8b-amd-npu sample code

Credits

goichi harada

Comments

Related channels and tags