RyzenAI Software 1.2 was announced on July 29, 2024. The following document provides details on how to set up RyzenAI Software 1.1.
RyzenAI Software 1.1 required manual steps to install, which is difficult for people who are not familiar with Windows. To solve this, I have written detailed instructions in this document to try to make it work for all users.
Ryzen AI Software 1.2 is now automatically installed, so the manual setup described in this document is no longer necessary. If you want to use the 1.2 driver, please refer to the official page. (I have added Appendix: Tips for running Ryzen AI Software 1.2 at the bottom of this page, so please refer to it if necessary.)
This page was created for 1.1 and verified with version 1.1. Unfortunately, we were not able to rewrite this page for 1.2 because 1.2 was announced just before the contest deadline(July 31, 2024).
Also amd/RyzenAI-SW 1.2 was announced on July 29, 2024. The following document use amd/RyzenAI-SW 1.1. Please note that the folder and script contents have been completely changed. This project uses the script and file from the 1.1 branch.
About this projectThis project focuses on porting my LLM-based Japanese-English machine translation model to AMD's new PC without GPUs, ensuring flawless operation.
As a contestant in the AMD Pervasive AI Developer Contest PC AI, I proposed creating an LLM-based Japanese-English machine translation model that runs seamlessly on a PC without a GPU.
This approach allows for a translation model that is:
- Data-secure
- Fully offline
- Easily customizable
This model will enable the development of various new communication tools that break down language barriers.
In addition to creating task-specific LLMs for translation, we ported general-purpose LLMs and verified their performance.
Project Approach
At the start of the project, I had already created two LLM-based Japanese-English machine translation models with Hugging Face Transformers.
1. Meta's LLaMA 2 7b based translation model
2. Google's Gemma 2 9b based translation model
The project goal would be achieved if ALMA-7B-Ja-V2 could be ported, but I was also hoping to try to port the C3TR-Adapter.
However, LLaMA 3.1 was released while the project was in progress, so in the end I created and ported two models:
- LLaMA 2-based multilingual(English, Japanese, French, Chinese(Mandarin)) translation model(ALMA-Ja-V3-amd-npu)
- LLaMA 3.1-based multilingual(English, Japanese, French, Chinese(Mandarin)) translation model(llama-translate-amd-npu)
Hardware information
- UM790 Pro
- CPU AMD Ryzen 9 7940HS Processor
- GPU AMD Radeon 780M
- System Memory 16GB x 2
- Storage 512GB
Very small. Width 130mm, height 126mm. Body weight 666g (1240g including AC adapter and AC cord)
It complies with VESA specifications and can be installed behind the display.
Enabling NPU in BIOS
1. First, check if your computer is a Ryzen AI engine enabled laptop.
2. The UM790 Pro has the NPU disabled in the BIOS by default. If left as is, the device cannot be found in the hardware device manager and the installation will fail.
3. To access the BIOS setting screen, hold down the DEL key when turning on the computer.
4. Press Setup.
5. Select "Advanced", then "CPU Configuration"
6. Change "IPU Control" drop down menu to Enable.
7. Select "Save & Exit" from the left menu.
How to download and install Visual Studio 2019
- Visual Studio 2019 is a bit old and needs to be downloaded from Visual Studio Older Downloads.
- When installing a new version, the version check of the setup script may fail
Problems with amd_install_kipudrv.bat
- Some users report an error stating the driver is already installed after running this batch file.
- It's better to overwrite the existing driver as it may be outdated.
- Run CMD in administrator mode to execute this successfully.
Problems with xrt_coreutil.dll
- Issues related to the xrt_coreutil.dll file can occur during installation.
- The installation may fail because this DLL file is not found.
- Some have resolved this by manually copying the DLL file to the appropriate directory and adding it to the PATH.
(1)Enable IPU
As mentioned above, enable the IPU in the BIOS menu. IPU is the old name for NPU.
(2)Follow the official installation manual
AMD official installation setup page(1.1)
After enable IPU, follow the official installation manual. However, there are some pitfalls, which are explained below.
(2-1)Download the NPU Driver and setup
You need AMD account, so make account and download ipu_stack_rel_silicon_prod_1.1.zip.
After downloading, unzip the zip file, launch CMD and run the batch file, but remember to run CMD as administrator. Because if you do not run CMD as an administrator, you will get a red error message saying access is denied and the process will fail.
The following screen is a little different because it is from the Japanese version of Windows 11. However, the steps are the same.
- Type cmd in the bottom center of the screen input field
- Right-click on the Command prompt part of the menu
- Select Run as administrator
Then go to the folder where you unzipped ipu_stack_rel_silicon_prod_1.1.zip
cd ipu_stack_rel_silicon_prod_1.1
then
.\amd_install_kipudrv.bat
If you run .\amd_install_kipudrv.bat, you will be told that the NPU Driver is already installed, but the version that comes pre-installed driver is 2023/05/15, and the driver in the downloaded folder is 2024/02/13, so you must install it.
Then check device manager and versions.
System device -> AMD IPU Device
Check driver date is 2024/02/13
(2-2)Install additional softwares.
(2-2-1) install Visual Studio 2019 Community Edition.
Yes, you need VC 2019 Community Edition(Free). So you need Microsoft account. Go to Visual Studio Older Downloads page and join Dev Essentials program and download 2019 version.
I chose the following:
- "Python development"
- "Desktop development with C++"
(2-2-1) install CMake
Go to Cmake download page and download
I used Windows x64 Installer(cmake-3.29.0-windows-x86_64.msi).
And choose "Add Cmake to the system PATH for the current user"
(2-2-3)install Miniconda
Go to miniconda download page and download it.
I used Miniconda3-latest-Windows-x86_64.msi.
I left the settings at the installation default and did not add the PATH. However, there were some failures during the setup afterwards, so I ended up adding the PATH manually.
Now let's add PATH to tell CMD where Conda is located.
However, if you want to use a different python installation, it is better to skip this step.
input env and choose "Edit System Environment Variables".
push "Environment Variables"
Select "Path" and push "Edit".
Push "New" and Enter the directory where you installed miniconda, and + \Script, + \lib\bin
Above is my case. "dev1" is my account. so you need to replace it with your account.
C:\User\<your_account_name>\miniconda3\
C:\User\<your_account_name>\miniconda3\Scripts
C:\User\<your_account_name>\miniconda3\lib\bin
I did not install Python because miniconda includes python 3.12
(2-3)Install the Ryzen AI Software(For 1.1)
Download the ryzen-ai-sw-1.1.zip, Ryzen AI Software installation package(ryzen-ai-sw-1.1.zip) and extract it.
The Ryzen AI Software directory will be referenced many times later during inference, so make sure to store it in a fixed location such as under the C drive, not in a temporary location.
Start CMD as administrator as before.
cd \ryzen-ai-sw-1.1
.\install.bat
.....
Setting RYZEN_AI_INSTALLER env variable ...
Setting XLNX_VART_FIRMWARE env variable ...
Created conda env: ryzenai-1.1-20240603-234049
In my case, the name of the Conda virtual environment was "ryzenai-1.1-20240603-234049"
This is the name of your virtual environment so it will probably be different. Please make a note of it.
If you have forgotten it, you can check it with the "conda env list" command:
C:\Users\dev1>conda env list
# conda environments:
#
base C:\Users\dev1\miniconda3
ryzenai-1.1-20240603-234049 C:\Users\dev1\miniconda3\envs\ryzenai-1.1-20240603-234049
ryzenai-transformers C:\Users\dev1\miniconda3\envs\ryzenai-transformers
Now, after installing Conda, initialize it:
conda init
Then quit cmd and launch it again.
Run the following command.
C:\Users\user\Downloads\ryzen-ai-sw-1.1\ryzen-ai-sw-1.1>conda activate ryzenai-1.1-20240603-234049
Remember, in my case it was ryzenai-1.1-20240603-234049. replace it with your virtual environment name.
Then Let's test it.
cd .\ryzen-ai-sw-1.1>quicktest
python quicktest.py
...
[Vitis AI EP] No. of Subgraphs : CPU 1 IPU 1 Actually running on IPU 1
2 0 2 4 - 0 6 - 0 3 2 3 : 5 4 : 4 3 . 9 1 8 6 0 4 5 [ W : o n n x r u n t i m e : , s e s s i o n _ s t a t e . c c : 1 1 6 9 o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ] S o m e n o d e s w e r e n o t a s s i g n e d t o t h e p r e f e r r e d e x e c u t i o n p r o v i d e r s w h i c h m a y o r m a y n o t h a v e a n n e g a t i v e i m p a c t o n p e r f o r m a n c e . e . g . O R T e x p l i c i t l y a s s i g n s s h a p e r e l a t e d o p s t o C P U t o i m p r o v e p e r f .
2 0 2 4 - 0 6 - 0 3 2 3 : 5 4 : 4 3 . 9 2 4 8 2 2 5 [ W : o n n x r u n t i m e : , s e s s i o n _ s t a t e . c c : 1 1 7 1 o n n x r u n t i m e : : V e r i f y E a c h N o d e I s A s s i g n e d T o A n E p ] R e r u n n i n g w i t h v e r b o s e o u t p u t o n a n o n - m i n i m a l b u i l d w i l l s h o w n o d e a s s i g n m e n t s .
Test Passed
2 0 2 4 - 0 6 - 0 3 2 3 : 5 4 : 4 4 . 7 5 0 1 7 0 5 [ W : o n n x r u n t i m e : D e f a u l t , v i t i s a i _ e x e c u t i o n _ p r o v i d e r . c c : 7 4 o n n x r u n t i m e : : V i t i s A I E x e c u t i o n P r o v i d e r : : ~ V i t i s A I E x e c u t i o n P r o v i d e r ] R e l e a s i n g t h e F l e x M L E P p o i n t e r i n V i t i s A I E P
If you see "Test Passed", everything is working properly. Congratulations!
After reading the official documentation about Runtime Setup(1.1)(You can choose between Throughput Profile and Latency Profile for the NPU settings), you can proceed with your project using this environment.
When you log out and start working again, don't forget to activate the virtual environment with CMD first. If something goes wrong with the environment, you can reset it by creating a new virtual environment.
conda activate ryzenai-1.1-YYYYMMDD-HHMMSS
Once you've completed this, you can proceed with your own project. We recommend that you refer to the amd/RyzenAI-SW 1.1 branch sample on Github. Good luck!
If your project is related to LLM, please continue reading below.
Running the LLama2 AWQ sampleAt first you need git for windows for download files from github.
So download and install it. I used Git-2.44.0-64-bit.exe.
Basically, proceed as described in https://github.com/amd/RyzenAI-SW/tree/1.1, but there is one big catch.
There are some commands that are supposed to be run in CMD and some that are supposed to be run in Powershell, but this is not always clear. The following steps will result in an error if they are not run in CMD. However, there were times when PowerShell was required in the RyzenAI_quant_tutorial tutorial.
Let's start.
Decide on a working folder and run the following command there to download the file:
git lfs install
git clone https://github.com/amd/RyzenAI-SW.git
cd RyzenAI-SW
git lfs pull
git lfs fetch --all
Follow the instructions below to carry out the procedure.
https://github.com/amd/RyzenAI-SW/tree/1.1/example/transformers
Step 1: Download repository and create conda environment based on provided yaml file
cd example\transformers
conda env create --file=env.yaml
conda activate ryzenai-transformers
download precomputed results. It's over 32.3GB.
cd ext
git lfs install
git clone https://huggingface.co/datasets/mit-han-lab/awq-model-zoo awq_cache
Step 2: Setup environment
cd ..
.\setup.bat
Step 3: Build dependencies
pip install ops\cpp --force-reinstall
You need CMD. If you use power shell, you encounter error like below.
CMake Error at CMakeLists.txt:15 (find_package):
Could not find a package configuration file provided by "XRT" with any of
the following names:
XRTConfig.cmake
xrt-config.cmake
Add the installation prefix of "XRT" to CMAKE_PREFIX_PATH or set "XRT_DIR"
to a directory containing one of the above files. If "XRT" provides a
separate development package or SDK, be sure it has been installed.
We skip Step 4 because our approach is PyTorch-based workflow.
Next, Follow the instructions below to carry out the procedure.
https://github.com/amd/RyzenAI-SW/tree/1.1/example/transformers/models/llama2/README.md
However, we don't need to do "Prepare Llama2 Weights to use with HF" because There is a model converted to HF (huggingface) format.
If you don't have one, create a huggingface account.
Then, Go to Llama-2-7b-chat-hf page and Apply to Meta.
Once approved, You can download the model using your browser, but if you want to use the git command for download, you will need to log in to huggingface_hub.
In that case, You may want to install huggingface-cli to get authentication.
After installing it according to the instructions on the official page, you can log in as follows:
Once logined, you can download the model and can run run_awq.py.
cd RyzenAI-SW\example\transformers\models\llama2
git lfs install
git clone https://huggingface.co/meta-llama/Llama-2-7b-chat-hf
move Llama-2-7b-chat-hf 7B_chat
mkdir llama-2-wts-hf
move 7B_chat llama-2-wts-hf
python run_awq.py --w_bit 4 --task quantize
It took about 7 minutes to complete.
logs
(ryzenai-transformers) C:\work\git\RyzenAI-SW\example\transformers\models\llama2>python run_awq.py --w_bit 4 --task quantize
C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Namespace(dataset='raw', w_bit=4, awq='load', target='cpu', task='quantize', flash_attention=False, lm_head=False, num_torch_threads=8)
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████| 2/2 [00:12<00:00, 6.09s/it]
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): Linear(in_features=4096, out_features=4096, bias=False)
(k_proj): Linear(in_features=4096, out_features=4096, bias=False)
(v_proj): Linear(in_features=4096, out_features=4096, bias=False)
(o_proj): Linear(in_features=4096, out_features=4096, bias=False)
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): Linear(in_features=4096, out_features=11008, bias=False)
(up_proj): Linear(in_features=4096, out_features=11008, bias=False)
(down_proj): Linear(in_features=11008, out_features=4096, bias=False)
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
**** Model size: 12916.516MB
Loading pre-computed AWQ results from C:\work\git\RyzenAI-SW\example\transformers\\ext\awq_cache\
Quantization config: {'zero_point': True, 'q_group_size': 128}
real weight quantization...: 100%|█████████████████████████████████████████████████████| 32/32 [05:44<00:00, 10.78s/it]
**** Model size: 6965.766MB
Model transformation: Replacing <class 'qmodule.WQLinear'> layers with <class 'qlinear.QLinearPerGrp'> ...
Model transformation done!: Replaced 224 <class 'qmodule.WQLinear'> layers with <class 'qlinear.QLinearPerGrp'>.
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(k_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(v_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(o_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(up_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(down_proj): ryzenAI.QLinearPerGrp(in_features:11008, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
**** Model size: 564.516MB
Quantized and saved model: pytorch_llama27b_w_bit_4_awq_amd.pt
Then, Run the inference.
python run_awq.py --task decode --target aie --w_bit 4
logs
(ryzenai-transformers) C:\work\git\RyzenAI-SW\example\transformers\models\llama2>python run_awq.py --w_bit 4 --task quantize
C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
C:\Users\dev1\miniconda3\envs\ryzenai-transformers\lib\site-packages\transformers\utils\generic.py:311: UserWarning: torch.utils._pytree._register_pytree_node is deprecated. Please use torch.utils._pytree.register_pytree_node instead.
torch.utils._pytree._register_pytree_node(
Namespace(dataset='raw', w_bit=4, awq='load', target='aie', task='decode', flash_attention=False, lm_head=False, num_torch_threads=8)
Loading from ckpt: pytorch_llama27b_w_bit_4_awq_amd.pt
**** Model size: 564.516MB
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(k_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(v_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(o_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(up_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(down_proj): ryzenAI.QLinearPerGrp(in_features:11008, out_features:4096, bias:torch.Size([1]), device:cpu, w_bit:4 group_size:128 )
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
Preparing weights of layer : model.layers.0.self_attn.q_proj
Preparing weights of layer : model.layers.0.self_attn.k_proj
Preparing weights of layer : model.layers.0.self_attn.v_proj
Preparing weights of layer : model.layers.0.self_attn.o_proj
Preparing weights of layer : model.layers.0.mlp.gate_proj
Preparing weights of layer : model.layers.0.mlp.up_proj
Preparing weights of layer : model.layers.0.mlp.down_proj
Preparing weights of layer : model.layers.1.self_attn.q_proj
Preparing weights of layer : model.layers.1.self_attn.k_proj
Preparing weights of layer : model.layers.1.self_attn.v_proj
Preparing weights of layer : model.layers.1.self_attn.o_proj
Preparing weights of layer : model.layers.1.mlp.gate_proj
Preparing weights of layer : model.layers.1.mlp.up_proj
Preparing weights of layer : model.layers.1.mlp.down_proj
Preparing weights of layer : model.layers.2.self_attn.q_proj
Preparing weights of layer : model.layers.2.self_attn.k_proj
Preparing weights of layer : model.layers.2.self_attn.v_proj
Preparing weights of layer : model.layers.2.self_attn.o_proj
Preparing weights of layer : model.layers.2.mlp.gate_proj
Preparing weights of layer : model.layers.2.mlp.up_proj
Preparing weights of layer : model.layers.2.mlp.down_proj
Preparing weights of layer : model.layers.3.self_attn.q_proj
Preparing weights of layer : model.layers.3.self_attn.k_proj
Preparing weights of layer : model.layers.3.self_attn.v_proj
Preparing weights of layer : model.layers.3.self_attn.o_proj
Preparing weights of layer : model.layers.3.mlp.gate_proj
Preparing weights of layer : model.layers.3.mlp.up_proj
Preparing weights of layer : model.layers.3.mlp.down_proj
Preparing weights of layer : model.layers.4.self_attn.q_proj
Preparing weights of layer : model.layers.4.self_attn.k_proj
Preparing weights of layer : model.layers.4.self_attn.v_proj
Preparing weights of layer : model.layers.4.self_attn.o_proj
Preparing weights of layer : model.layers.4.mlp.gate_proj
Preparing weights of layer : model.layers.4.mlp.up_proj
Preparing weights of layer : model.layers.4.mlp.down_proj
Preparing weights of layer : model.layers.5.self_attn.q_proj
Preparing weights of layer : model.layers.5.self_attn.k_proj
Preparing weights of layer : model.layers.5.self_attn.v_proj
Preparing weights of layer : model.layers.5.self_attn.o_proj
Preparing weights of layer : model.layers.5.mlp.gate_proj
Preparing weights of layer : model.layers.5.mlp.up_proj
Preparing weights of layer : model.layers.5.mlp.down_proj
Preparing weights of layer : model.layers.6.self_attn.q_proj
Preparing weights of layer : model.layers.6.self_attn.k_proj
Preparing weights of layer : model.layers.6.self_attn.v_proj
Preparing weights of layer : model.layers.6.self_attn.o_proj
Preparing weights of layer : model.layers.6.mlp.gate_proj
Preparing weights of layer : model.layers.6.mlp.up_proj
Preparing weights of layer : model.layers.6.mlp.down_proj
Preparing weights of layer : model.layers.7.self_attn.q_proj
Preparing weights of layer : model.layers.7.self_attn.k_proj
Preparing weights of layer : model.layers.7.self_attn.v_proj
Preparing weights of layer : model.layers.7.self_attn.o_proj
Preparing weights of layer : model.layers.7.mlp.gate_proj
Preparing weights of layer : model.layers.7.mlp.up_proj
Preparing weights of layer : model.layers.7.mlp.down_proj
Preparing weights of layer : model.layers.8.self_attn.q_proj
Preparing weights of layer : model.layers.8.self_attn.k_proj
Preparing weights of layer : model.layers.8.self_attn.v_proj
Preparing weights of layer : model.layers.8.self_attn.o_proj
Preparing weights of layer : model.layers.8.mlp.gate_proj
Preparing weights of layer : model.layers.8.mlp.up_proj
Preparing weights of layer : model.layers.8.mlp.down_proj
Preparing weights of layer : model.layers.9.self_attn.q_proj
Preparing weights of layer : model.layers.9.self_attn.k_proj
Preparing weights of layer : model.layers.9.self_attn.v_proj
Preparing weights of layer : model.layers.9.self_attn.o_proj
Preparing weights of layer : model.layers.9.mlp.gate_proj
Preparing weights of layer : model.layers.9.mlp.up_proj
Preparing weights of layer : model.layers.9.mlp.down_proj
Preparing weights of layer : model.layers.10.self_attn.q_proj
Preparing weights of layer : model.layers.10.self_attn.k_proj
Preparing weights of layer : model.layers.10.self_attn.v_proj
Preparing weights of layer : model.layers.10.self_attn.o_proj
Preparing weights of layer : model.layers.10.mlp.gate_proj
Preparing weights of layer : model.layers.10.mlp.up_proj
Preparing weights of layer : model.layers.10.mlp.down_proj
Preparing weights of layer : model.layers.11.self_attn.q_proj
Preparing weights of layer : model.layers.11.self_attn.k_proj
Preparing weights of layer : model.layers.11.self_attn.v_proj
Preparing weights of layer : model.layers.11.self_attn.o_proj
Preparing weights of layer : model.layers.11.mlp.gate_proj
Preparing weights of layer : model.layers.11.mlp.up_proj
Preparing weights of layer : model.layers.11.mlp.down_proj
Preparing weights of layer : model.layers.12.self_attn.q_proj
Preparing weights of layer : model.layers.12.self_attn.k_proj
Preparing weights of layer : model.layers.12.self_attn.v_proj
Preparing weights of layer : model.layers.12.self_attn.o_proj
Preparing weights of layer : model.layers.12.mlp.gate_proj
Preparing weights of layer : model.layers.12.mlp.up_proj
Preparing weights of layer : model.layers.12.mlp.down_proj
Preparing weights of layer : model.layers.13.self_attn.q_proj
Preparing weights of layer : model.layers.13.self_attn.k_proj
Preparing weights of layer : model.layers.13.self_attn.v_proj
Preparing weights of layer : model.layers.13.self_attn.o_proj
Preparing weights of layer : model.layers.13.mlp.gate_proj
Preparing weights of layer : model.layers.13.mlp.up_proj
Preparing weights of layer : model.layers.13.mlp.down_proj
Preparing weights of layer : model.layers.14.self_attn.q_proj
Preparing weights of layer : model.layers.14.self_attn.k_proj
Preparing weights of layer : model.layers.14.self_attn.v_proj
Preparing weights of layer : model.layers.14.self_attn.o_proj
Preparing weights of layer : model.layers.14.mlp.gate_proj
Preparing weights of layer : model.layers.14.mlp.up_proj
Preparing weights of layer : model.layers.14.mlp.down_proj
Preparing weights of layer : model.layers.15.self_attn.q_proj
Preparing weights of layer : model.layers.15.self_attn.k_proj
Preparing weights of layer : model.layers.15.self_attn.v_proj
Preparing weights of layer : model.layers.15.self_attn.o_proj
Preparing weights of layer : model.layers.15.mlp.gate_proj
Preparing weights of layer : model.layers.15.mlp.up_proj
Preparing weights of layer : model.layers.15.mlp.down_proj
Preparing weights of layer : model.layers.16.self_attn.q_proj
Preparing weights of layer : model.layers.16.self_attn.k_proj
Preparing weights of layer : model.layers.16.self_attn.v_proj
Preparing weights of layer : model.layers.16.self_attn.o_proj
Preparing weights of layer : model.layers.16.mlp.gate_proj
Preparing weights of layer : model.layers.16.mlp.up_proj
Preparing weights of layer : model.layers.16.mlp.down_proj
Preparing weights of layer : model.layers.17.self_attn.q_proj
Preparing weights of layer : model.layers.17.self_attn.k_proj
Preparing weights of layer : model.layers.17.self_attn.v_proj
Preparing weights of layer : model.layers.17.self_attn.o_proj
Preparing weights of layer : model.layers.17.mlp.gate_proj
Preparing weights of layer : model.layers.17.mlp.up_proj
Preparing weights of layer : model.layers.17.mlp.down_proj
Preparing weights of layer : model.layers.18.self_attn.q_proj
Preparing weights of layer : model.layers.18.self_attn.k_proj
Preparing weights of layer : model.layers.18.self_attn.v_proj
Preparing weights of layer : model.layers.18.self_attn.o_proj
Preparing weights of layer : model.layers.18.mlp.gate_proj
Preparing weights of layer : model.layers.18.mlp.up_proj
Preparing weights of layer : model.layers.18.mlp.down_proj
Preparing weights of layer : model.layers.19.self_attn.q_proj
Preparing weights of layer : model.layers.19.self_attn.k_proj
Preparing weights of layer : model.layers.19.self_attn.v_proj
Preparing weights of layer : model.layers.19.self_attn.o_proj
Preparing weights of layer : model.layers.19.mlp.gate_proj
Preparing weights of layer : model.layers.19.mlp.up_proj
Preparing weights of layer : model.layers.19.mlp.down_proj
Preparing weights of layer : model.layers.20.self_attn.q_proj
Preparing weights of layer : model.layers.20.self_attn.k_proj
Preparing weights of layer : model.layers.20.self_attn.v_proj
Preparing weights of layer : model.layers.20.self_attn.o_proj
Preparing weights of layer : model.layers.20.mlp.gate_proj
Preparing weights of layer : model.layers.20.mlp.up_proj
Preparing weights of layer : model.layers.20.mlp.down_proj
Preparing weights of layer : model.layers.21.self_attn.q_proj
Preparing weights of layer : model.layers.21.self_attn.k_proj
Preparing weights of layer : model.layers.21.self_attn.v_proj
Preparing weights of layer : model.layers.21.self_attn.o_proj
Preparing weights of layer : model.layers.21.mlp.gate_proj
Preparing weights of layer : model.layers.21.mlp.up_proj
Preparing weights of layer : model.layers.21.mlp.down_proj
Preparing weights of layer : model.layers.22.self_attn.q_proj
Preparing weights of layer : model.layers.22.self_attn.k_proj
Preparing weights of layer : model.layers.22.self_attn.v_proj
Preparing weights of layer : model.layers.22.self_attn.o_proj
Preparing weights of layer : model.layers.22.mlp.gate_proj
Preparing weights of layer : model.layers.22.mlp.up_proj
Preparing weights of layer : model.layers.22.mlp.down_proj
Preparing weights of layer : model.layers.23.self_attn.q_proj
Preparing weights of layer : model.layers.23.self_attn.k_proj
Preparing weights of layer : model.layers.23.self_attn.v_proj
Preparing weights of layer : model.layers.23.self_attn.o_proj
Preparing weights of layer : model.layers.23.mlp.gate_proj
Preparing weights of layer : model.layers.23.mlp.up_proj
Preparing weights of layer : model.layers.23.mlp.down_proj
Preparing weights of layer : model.layers.24.self_attn.q_proj
Preparing weights of layer : model.layers.24.self_attn.k_proj
Preparing weights of layer : model.layers.24.self_attn.v_proj
Preparing weights of layer : model.layers.24.self_attn.o_proj
Preparing weights of layer : model.layers.24.mlp.gate_proj
Preparing weights of layer : model.layers.24.mlp.up_proj
Preparing weights of layer : model.layers.24.mlp.down_proj
Preparing weights of layer : model.layers.25.self_attn.q_proj
Preparing weights of layer : model.layers.25.self_attn.k_proj
Preparing weights of layer : model.layers.25.self_attn.v_proj
Preparing weights of layer : model.layers.25.self_attn.o_proj
Preparing weights of layer : model.layers.25.mlp.gate_proj
Preparing weights of layer : model.layers.25.mlp.up_proj
Preparing weights of layer : model.layers.25.mlp.down_proj
Preparing weights of layer : model.layers.26.self_attn.q_proj
Preparing weights of layer : model.layers.26.self_attn.k_proj
Preparing weights of layer : model.layers.26.self_attn.v_proj
Preparing weights of layer : model.layers.26.self_attn.o_proj
Preparing weights of layer : model.layers.26.mlp.gate_proj
Preparing weights of layer : model.layers.26.mlp.up_proj
Preparing weights of layer : model.layers.26.mlp.down_proj
Preparing weights of layer : model.layers.27.self_attn.q_proj
Preparing weights of layer : model.layers.27.self_attn.k_proj
Preparing weights of layer : model.layers.27.self_attn.v_proj
Preparing weights of layer : model.layers.27.self_attn.o_proj
Preparing weights of layer : model.layers.27.mlp.gate_proj
Preparing weights of layer : model.layers.27.mlp.up_proj
Preparing weights of layer : model.layers.27.mlp.down_proj
Preparing weights of layer : model.layers.28.self_attn.q_proj
Preparing weights of layer : model.layers.28.self_attn.k_proj
Preparing weights of layer : model.layers.28.self_attn.v_proj
Preparing weights of layer : model.layers.28.self_attn.o_proj
Preparing weights of layer : model.layers.28.mlp.gate_proj
Preparing weights of layer : model.layers.28.mlp.up_proj
Preparing weights of layer : model.layers.28.mlp.down_proj
Preparing weights of layer : model.layers.29.self_attn.q_proj
Preparing weights of layer : model.layers.29.self_attn.k_proj
Preparing weights of layer : model.layers.29.self_attn.v_proj
Preparing weights of layer : model.layers.29.self_attn.o_proj
Preparing weights of layer : model.layers.29.mlp.gate_proj
Preparing weights of layer : model.layers.29.mlp.up_proj
Preparing weights of layer : model.layers.29.mlp.down_proj
Preparing weights of layer : model.layers.30.self_attn.q_proj
Preparing weights of layer : model.layers.30.self_attn.k_proj
Preparing weights of layer : model.layers.30.self_attn.v_proj
Preparing weights of layer : model.layers.30.self_attn.o_proj
Preparing weights of layer : model.layers.30.mlp.gate_proj
Preparing weights of layer : model.layers.30.mlp.up_proj
Preparing weights of layer : model.layers.30.mlp.down_proj
Preparing weights of layer : model.layers.31.self_attn.q_proj
Preparing weights of layer : model.layers.31.self_attn.k_proj
Preparing weights of layer : model.layers.31.self_attn.v_proj
Preparing weights of layer : model.layers.31.self_attn.o_proj
Preparing weights of layer : model.layers.31.mlp.gate_proj
Preparing weights of layer : model.layers.31.mlp.up_proj
Preparing weights of layer : model.layers.31.mlp.down_proj
LlamaForCausalLM(
(model): LlamaModel(
(embed_tokens): Embedding(32000, 4096)
(layers): ModuleList(
(0-31): 32 x LlamaDecoderLayer(
(self_attn): LlamaAttention(
(q_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128 )
(k_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128 )
(v_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128 )
(o_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128 )
(rotary_emb): LlamaRotaryEmbedding()
)
(mlp): LlamaMLP(
(gate_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:None, device:aie, w_bit:4 group_size:128 )
(up_proj): ryzenAI.QLinearPerGrp(in_features:4096, out_features:11008, bias:None, device:aie, w_bit:4 group_size:128 )
(down_proj): ryzenAI.QLinearPerGrp(in_features:11008, out_features:4096, bias:None, device:aie, w_bit:4 group_size:128 )
(act_fn): SiLUActivation()
)
(input_layernorm): LlamaRMSNorm()
(post_attention_layernorm): LlamaRMSNorm()
)
)
(norm): LlamaRMSNorm()
)
(lm_head): Linear(in_features=4096, out_features=32000, bias=False)
)
**** Model size: 564.512MB
Warming up ...
Warm up DONE!!
****************************************
prompt: What is the meaning of life?
response: What is the meaning of life?
The question of the meaning of life is a philosoph
****************************************
prompt: Tell me something you don't know.
response: Tell me something you don't know.
I don't know if you're
****************************************
prompt: What does Xilinx do?
response: What does Xilinx do?
Xilinx is a leading provider of programm
****************************************
prompt: What is the mass of earth?
response: What is the mass of earth?
The mass of Earth is approximately 5.
****************************************
prompt: What is a poem?
response: What is a poem?
A poem is a piece of writing that uses
****************************************
prompt: What is recursion?
response: What is recursion?
Recursion is a programming technique where a
****************************************
prompt: Tell me a one line joke.
response: Tell me a one line joke.
Here is a one-liner for you
****************************************
prompt: Who is Gilgamesh?
response: Who is Gilgamesh?
Gilgamesh is a legendary king
****************************************
prompt: Tell me something about cryptocurrency.
response: Tell me something about cryptocurrency.
Cryptocurrency is a digital or virtual currency
****************************************
prompt: How did it all begin?
response: How did it all begin?
The concept of the "end times" has
Number of prompts found in log: 10
Example#:1 Prompt-len:8 New-tokens-generated:11 Total-time:3.514s Prefill-phase:549.933ms Time/token:293mTokens/sec:3.4
Example#:2 Prompt-len:10 New-tokens-generated:11 Total-time:3.936s Prefill-phase:985.360ms Time/token:294mTokens/sec:3.4
Example#:3 Prompt-len:8 New-tokens-generated:11 Total-time:3.487s Prefill-phase:531.689ms Time/token:291mTokens/sec:3.4
Example#:4 Prompt-len:8 New-tokens-generated:11 Total-time:3.460s Prefill-phase:542.736ms Time/token:288mTokens/sec:3.5
Example#:5 Prompt-len:6 New-tokens-generated:11 Total-time:3.446s Prefill-phase:503.883ms Time/token:294mTokens/sec:3.4
Example#:6 Prompt-len:5 New-tokens-generated:11 Total-time:3.435s Prefill-phase:503.075ms Time/token:290mTokens/sec:3.4
Example#:7 Prompt-len:9 New-tokens-generated:11 Total-time:3.930s Prefill-phase:960.514ms Time/token:297mTokens/sec:3.4
Example#:8 Prompt-len:8 New-tokens-generated:11 Total-time:3.455s Prefill-phase:531.682ms Time/token:289mTokens/sec:3.5
Example#:9 Prompt-len:9 New-tokens-generated:11 Total-time:3.893s Prefill-phase:955.288ms Time/token:294mTokens/sec:3.4
Example#:10 Prompt-len:7 New-tokens-generated:11 Total-time:3.508s Prefill-phase:518.723ms Time/token:295mTokens/sec:3.4
(ryzenai-transformers) C:\work\git\RyzenAI-SW\example\transformers\models\llama2>
OK, Congratulations!
When you log out and start working again, don't forget to activate the virtual environment with CMD first. Then run setup.bat again.
conda activate ryzenai-transformers
\RyzenAI-SW\example\transformers>.\setup.bat
If something goes wrong with the environment, you can reset it using the following command.
conda deacitvate
conda env remove --name <your_env_name>
then, creating a new virtual environment.
I hope your application changes the world for the better. Good luck!
Other Findings.
- Using huggingface's optimum library, I was able to convert both llama2 7b and gemma 7b to ONNX format, but the accuracy was greatly reduced.
- At the moment, it is unavoidable that accuracy will decrease when converting the model format. Therefore, it is important to avoid format conversion as much as possible. There seem to be two options: either use a model trained from scratch in ONNX format, or train a model in Hugging Face Transformers format and quantize it directly with AWQ using RyzenAI-SW. This project chose the latter.
- I created an AWQ version of LLama 3, Llama3.1, and uploaded it to huggingface. It has been confirmed that it can speak English, Japanese, Chinese (Simplified), French, Korean, German, and Taiwanese (Traditional). If you want to focus on the application layer rather than porting the model, you should use it.
- For Ryzen AI Software 1.2, AMD has a lot of models available, so I recommend checking out the RyzenAI Model Zoo.
Now that the setup is complete, I'll briefly explain what I did in this project before running my ported model.
Step1 Model Training.
The trained model is already available for download at huggingface, so you don't need to perform this step. I'll briefly explain what I did and why.
(1)Continual Pre-training
The original Llama series model supports multiple languages, but Japanese and Chinese are not an officially supported language.
In order to give the model additional knowledge of Japanese and Chinese, we are conducting what is called Continual Pre-training. It is generally recommended to continue training until the loss graph becomes horizontal.
(2)Fine-tuning
AI can perform a variety of tasks, but small models that run on PCs perform better when they are specially tuned for specific tasks. Here too, it is generally recommended to continue training until the evaluation loss graph becomes horizontal.
Example output without fine-tuned model.
The second line is output in romaji (a notation that uses the alphabet to write Japanese). Part of the first line is omitted.
Example output with fine-tunedmodel.
The second line is completely written in Japanese. The first line is also translated without omitting.
Step 2 AWQ quantization
The models we created are very large, exceeding 10GB in size, so we used AWQ(Activation-aware Weight Quantization) using llm-awq by mit-han-lab to reduce the size of the models.
Please refer to the official llm-awq github page for installation and setup.
The work was done in a Linux environment. Please note that it was not possible to run the program due to insufficient memory unless the GPU had 40GB or more.
Benchmark results of the created model
In this project, the following six models were ported to Ryzen AI PC.
- Llama2 7B: A large-scale language model for natural language processing developed by Meta, released on July 18, 2023. It is an improved version of Llama 1.
- Llama3 8B: The next-generation version of Llama 2, announced by Meta on April 18, 2024. This model features more advanced natural language processing functions and improved performance.
- Llama3.1 8B: The latest model with significant improvements over Llama 3, announced by Meta on July 23, 2024.
- ALMA-Ja-V3: Llama2 7B with additional training to improve translation capabilities.
- llama3-8b_translation: Llama3 with additional training to improve translation capabilities.
- Llama-translate: Llama3.1 with additional training to improve translation capabilities.
Llama3 8B and Llama 3.1 8B are general-purpose models and are thought to be useful outside of this project, so they were made public on HuggingFace.
We also compared the performance of Llama2 7B, the official AMD implementation of AWQ quantization, in terms of perplexity scores.
Since lower perplexity score is better, it was confirmed that the our LLama2 version has improved performance compared to the official implementation.
Perplexity scores cannot be compared simply between different models, so it is not a problem that LLama3 is higher than LLama2. It was confirmed that LLama3.1 has a lower perplexity score than LLama3, which reflects the performance improvement of the base model.
Benchmarking on real tasks
We compared the performance of the following two models created for the translation task:
- ALMA-Ja-V3
Based on llama2 7b, Improved translation capabilities between Japanese and English.
- Llama translate
Based on llama3.1 8b, Improved translation capabilities between English, Japanese, Chinese, and French.
The benchmark used for comparison is flores200, a multilingual translation test created by Meta. The metrics are based on Unbabel/COMET's xcomet-xl, which is said to be close to human evaluation.
Llama-translate outperforms ALMA-Ja-V3. ALMA-Ja-V3 tends to omit long sentences, but it runs quickly.
One more benchmark results
We compared the multilingual translation function with Google Translate using mini flores200, which has a reduced amount of data.
There is almost no difference between jaen, fren, and cnen. So we can say that its performance is almost as good as Google Translate at least from X to English.
But please note that these scores vary depending on the length of the sentence and the text category (formal/informal), so it is important to test them using actual sentences.
Model download and sample scriptAll four models created in this project have been uploaded to huggingface. There is a setup instructions and sample scripts are also included.
- llama3-8b-amd-npu General-purpose model
- llama3.1-8b-Instruct-amd-npu General-purpose model
- ALMA-Ja-V3-amd-npu Models for translation tasks
- llama-translate-amd-npu Models for translation tasks
You can try out these models and incorporate them into your own projects.
Sample applicationLet's create a real application using these models.
The Olympics are underway in France.
Wouldn't it be great if you could get real-time Olympic live updates from a website and delivered in your language?
Let's get ready. Please note that this is a 1.1 document, so file locations etc. are different in 1.2.
(1)Start the conda environment and run the setup.bat
conda activate ryzenai-transformers
<your_install_path>\RyzenAI-SW\example\transformers\setup.bat
(2)Install the necessary libraries
pip install selenium
pip install webdriver_manager
pip install -U "huggingface_hub[cli]"
pip install transformers==4.43.3
# Updating the Transformers library will cause the LLama 2 sample to stop working.
# If you want to run LLama 2 again, revert to pip install transformers==4.34.0.
(3)download model
huggingface-cli download dahara1/llama-translate-amd-npu --revision main --local-dir llama-translate-amd-npu
(4)Copy modules and Set runtime
copy <your_ryzen_ai-sw_install_path>\RyzenAI-SW\example\transformers\models\llama2\modeling_llama_amd.py .
# set up Runtime. see https://ryzenai.docs.amd.com/en/latest/runtime_setup.html
set XLNX_VART_FIRMWARE=<your_firmware_install_path>\voe-4.0-win_amd64\1x4.xclbin
set NUM_OF_DPU_RUNNERS=1
(5)We'll allow our python program to communicate through the firewall.
The following screen is a little different because it is from the Japanese version of Windows 11. However, the steps are the same.
Type def in the bottom center of the screen input field. and click windows Defender fire walls.
Next, Push "Allow an app or feature through Windows Defender Firewall"
Next, Push "Change settings" then, "Allow another app..."
Next, Push "Browse..."
Next, Search your conda or miniconda env path and select python.exe and Open.
Next, Push Add Button.
Next, After confirming that python has been added, click OK to close the window.
This operation will enable python to communicate with the Internet, but if you do not plan to use it in the future, you can remove python by following the steps in reverse after the sample script has finished running.
For the script, please refer to the view_olympic_llama-translate.py in the Code section of this page.
Below you can see the script in action.
Other interesting uses would be combining it with OCR software, transcription software, free games, etc.
Why on-device model is importantThis system offers several advantages, including low power consumption, quiet operation, and low temperature. It can operate without API registration, ensuring privacy protection and eliminating concerns about data leaks.
Many recent services claim that user-entered data will be used for AI learning. However, there may be instances where users input text they do not own the rights to or conversation logs they have not agreed to share. Even services that state they will not use data for AI learning can pose risks if the data is leaked by the company, often resulting in unsatisfactory compensation.
Local AI is easy to customize and works in offline environments. As privacy protections are increasingly reconsidered, local AI is likely to see wider adoption.
Another advantage of local one-task AI models is the absence of refusal to perform certain tasks. For example, when attempting to use chatGPT to translate news related to terrorism, I received the response, "I can't cooperate with terrorism." This limitation is not a concern with local one-task AI models.
I hope that this model will become a useful AI for everyone.
Limitation- The llama-translate model does not fully implement RoPE and the context length is shorter than the original Llama 3.1.
- The llama-translate model does not implement Flash Attention.
- The llama-translate model may not be able to fully utilize the performance of the NPU. This may be improved in Version 1.2, as a profiler is now available.
- The execution speed is still slow. It is not yet practical for translating in applications that require real-time performance, such as chat. However, Version 1.2 includes speed improvements such as speculative decoding, and the hardware I used this time is TOPs 10, but a TOPs 40 machine is scheduled to be released in the fall, so this problem is expected to gradually improve.
If you want to incorporate these models into your projects, please check out the meta community license.
llama-translate-amd-npu is based on Meta-Llama-3.1-8B-Instruct. So if you want to incorporate these models into your projects, please check out the LLAMA 3.1 COMMUNITY LICENSE and follow it. It is a commercially available license, but has naming and credit restrictions.
ALMA-Ja-V3-amd-npu is based on Llama-2-7b-hf. So please read and follow the LLAMA 2 COMMUNITY LICENSE.
Also see the Meta Llama 3 Acceptable Use Policy and Responsible Use Guide.
Reference information and AcknowledgementsHere is some information that may be helpful. Thank you.
(1)AMD Pervasive AI Developer Contest
Contest Page.
(2)AMD Pervasive AI Developer Contest PC AI Study Guide
PC AI Contest Details.
(3)AMD Ryzen™ AI Software
Software Development Guide
(4)Ryzen AI: Getting Started Guide
start guide
(5)RyzenAI-SW
example Reference Code
(6)Riallto
exploration framework for the AMD Ryzen AI NPU
(7)amd community support forum
AMD Communities for AI
(8)hackster.io discord
PC AI Discord Channel
(9)Optimum-AMD
Hugging Face libraries enabling performance optimizations for ROCm for AMD GPUs and Ryzen AI for AMD NPU accelerator.
(10)meta-llama
Built with Meta Llama 3 and LLama 2
This is a tip for those who have already set up Ryzen AI Software 1.1 to migrate to 1.2. When installing 1.2 for the first time, please refer to the documentation for 1.1 as appropriate.
Basically, follow the official instructions given here.
Ryzen AI Software 1.2 Installation Instructions
Major changes
Depending on your CPU model, you need to select the appropriate setup.bat or runtime setup etc.
- Phoenix (PHX): AMD Ryzen™ 7940HS, 7840HS, 7640HS, 7840U, 7640U.
- Hawk (HPT): AMD Ryzen™ 8640U, 8640HS, 8645H, 8840U, 8840HS, 8845H, 8945H.
- Strix (STX): AMD Ryzen™ Ryzen AI 9 HX370, Ryzen AI 9 365
The competition participants are using 7940HS, so from here on we will assume that the CPU is PHX.
Software that needs to be installed
(1)Visual Studio 2022 Community
Downloads VC 2022 fee version from here. Maybe you need create Microsoft account.
We need check "Desktop Development with C++". (I also selected "python".)
For now, it works fine without deleting Visual Studio 2019 (required for Ryzen AI Software 1.1)
(2)cmake version >= 3.26
This should not be a problem, but please refer to the description in 1.1 of this page if necessary.
(3)Anaconda or Miniconda Latest version
In the explanation of 1.1 on this page, the path is set as a user environment variable. However, in 1.2, it needs to be set as a system environment variable. Set it at the bottom of the screen instead of the top.
Installing the NPU Driver
Download from the link below. Maybe you need create AMD account.
Extract the zip file and run the following command as administrator.
If you don't know how to run as administrator, please refer to the description in 1.1 on this page.
After the installation is complete, check that the date is 2024/07/26 in Device Manager. If it is not 2024/07/26, the setup has failed, so please check the documentation and instructions again.
Install RyzenAI Software
Download RyzenAI Software 1.2 from here.
This will create a Conda environment called ryzen-ai-1.2.0.
If you do not set it to a system environment variable, you will get an error that conda cannot be found.
This RyzenAI Software 1.2 is set up to run CNN models (i.e. models for people who work with images), which is not relevant for LLM, but let's run it to the end to see how it works.
Start the conda virtual environment you created for testing.
conda activate ryzen-ai-1.2.0
Runtime setup
This must be done every time before starting the application, whenever you use the NPU, regardless of whether you are using CNN or LMM.
The official instructions are on this page. As mentioned above, use different ones depending on the CPU model.
As a 7940HS owner, I use this.
set XLNX_VART_FIRMWARE=%RYZEN_AI_INSTALLATION_PATH%/voe-4.0-win_amd64/xclbins/phoenix/1x4.xclbin
set XLNX_TARGET_NAME=AMD_AIE2_Nx4_Overlay
Test it! and maybe fail it!
Let's run the following test.
cd %RYZEN_AI_INSTALLATION_PATH%/quicktest
python quicktest.py
If you are using a non-English version of Windows, the following error message may appear and the test may fail.
File "C:\Program Files\RyzenAI\1.2.0\quicktest\quicktest.py", line 16, in get_apu_info
if 'PCI\\VEN_1022&DEV_1502&REV_00' in stdout.decode(): apu_type = 'PHX/HPT'
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x83 in position 14: invalid start byte
I am using the Japanese version of Windows, so I was able to complete the test by making the following changes to the script.
After some trial and error, if you pass the test, you should see the following screen.
If your purpose is CNN, this completes the setup. For other purposes, such as LLM, you will need to perform a specific setup by referring to amd/RyzenAI-SW on github.
There are a lot of interesting updates in 1.2.
- Support for llama.cpp gguf format (4_0 only, models before 2024 Apr 21)
- Implementation of NPU profiler
- Improved execution speed by speculative decoding
Unfortunately, it is still in the early release stage, so it often does not work as expected, but it will improve over time.
Comments