Welcome to Ray, your personal AI assistant designed to make your life easier and more enjoyable. With the ability to learn and adapt to your unique needs and preferences, Ray is more than just a conversational tool - it's a personalized companion that can help you manage your digital life, complete tasks, and even provide a dash of humor and entertainment.
Running locally on your PC, Ray leverages the power of AMD's Ryzen AI software to deliver a range of innovative features, from email management and app downloads to PDF summarization. With Ray, you can enjoy the benefits of AI-powered assistance without the need for an internet connection, making it the perfect solution for those who value privacy and control.
2. Key Project features2.1 Chat InterfaceCurrently the user interacts with the model through a chat interface built as a webpage but running on an app contained in a minimal cefSharp Web Browser built using C#. This ensures that the app does not require the user to open a browser which saves on additional memory. The chat interface runs on startup and is implemented as follows
- During normal operation mode when the user is interacting with other windows, the app is minimized to display only a time widget at the top right that does not affect interaction with the window below it.
- When the user wants to open maximize the app, all they have to do is click on an icon in the system tray.
The chat interface has three chat options
- Chat - This allows for normal conversational mode
- PDF - This allows the user to interact with PDFs
Tools - This allows the user to call tools available to the assistant
The chat interface has three chat options
1. Chat - This allows for normal conversational mode
2. PDF - This allows the user to interact with PDFs
3. Tools - This allows the user to call tools available to the assistant
2.2. Conversational MemoryFor conversational memory, Ray uses a vector store (ChromaDB) to dynamically store user messages as they come in. This is used in conjunction with a conversational buffer window that supplies the last 4 interactions in addition to 4 most relevant messages from past interactions. For a considerable speed up, this is implemented from scratch.
2.3. RAGFor PDF document interactions, an embedding model (all-mpnet-base-v2
) is to create embeddings which are then stored in a vector database. The vector database is then queried on the user prompt returning the best 4 relevant chunks which are then passed to the model along with the user prompt. This is also implemented from scratch allowing for more flexibility.
The user can upload a PDF using the chat interface and the assistant will split the document using LangChain's RecussiveCharacterTextSplitter then store the chunks in a vector database in memory for querying
2.4. Tool CallingThis is the major advantage of using Meta's LLama3 model. With a custom implementation of function calling, the assistant is able to perform tasks on the users computer that many others can't. Currently the assistant only has access to the following functions
- Getting random jokes from the internet
- Getting the current weather
- Sending emails
- Downloading and installing apps for the user using winget
As mention before, the assistant is running on Meta's LLama3-8B-instruct model. This model is quantized to using RyzenAI LLM quantizer dropping its size down from 16 GB to 7.8 GB for the eager mode. An onnx model is also available for CPU execution at 5.6GB
Due to the slow inference of the model on Phoenix NPU, Ray implements both CPU execution with onnx quantized model while also having the ability to run on the NPU on much faster Strix NPUs
3. Running the Model using Ryzen AI SWRay offers 2 possibilities:
1. Running on the CPU
2. Running on the NPU
On the 7940HS processor, the model runs faster on the CPU with a tokens/sec of 7.8 while also being able to run on the Phoenix NPU with about 2.3 tokens/sec. The benchmarks are given later.
There are 3 ways to run models on the NPU:
- Quantizing the model to onnx
- Eager mode - Quantizing the model to a pytorch model and running with torch
- Using llama.cpp
For this project, I tried all three methods and eventually settled on onnx for its simplicity as well as resultant model size on the CPU and pytorch model for the NPU.
3.1 Preliminaries - Installing Ryzen AI v1.2For this project I used Ryzen™ AI Software to get the 1.2 driver as well as files needed to run transformer models.
To install the driver, you follow this tutorial on the official page.
The 1.2 driver version comes with the advantage of showing NPU utilization on the task manager.
To run a model using the onnx flow, the following steps are followed; This project was built while the 1.2 version was still in early access but I will provide the documentation to run the the official release
- Download the model from Hugging face - In this case, I chose to use Meta-Llama-3-8B-Instruct. However, any LLM that can fit into memory should work
- Setup the transformers1.2 environment
- Use transformers 1.2 to quantize and save the model to onnx
The steps to setup the ryzenai-transformers environment as well as quantize the model are included in the GitHub repository's read me file.
- To run the model however, another environment is setup to prevent conflict
To run onnx model, a separate environment is setup. In this case, the environment can be setup by following this tutorial on AMD official release
3.2.2 Quantizing the model to onnxThis is done by editing the prepare_model.py file to include my local model. Then running the file with --export
and --quantize
arguments. Optimization requires a more than 64 GB ram which I do not have.
python prepare_model.py --model_name "D:/llama3-8b-instruct" --export --quantize
This process takes a lot of disk space so ensure you have more than 100GB of free storage on disk.
3.3 Running on Eager ModeTo get a model that I could run on eager mode, I followed the documentation on trasnformers/models/llm from which I was able to generate a checkpoint used for model inference.
4 Results4.1 Onnx resultsRunning 8B onnx model on the CPU gives about 7.8tokens/sec which matches that of Ollama. However, due to the method of implementation of the memory, the time to first tokens doesn't get as adversely affected as in Ollama which simply pushes the entire current history.
Running on the NPU requires one to run setup_phx.bat to ensure that the correct .xclbins
are loaded. Running on the Phoenix NPU produces 2.3tokens/sec with a 3 times longer load time that that of the CPU provider. The Model does, however, end up using less CPU
As seen, the model uses less CPU(30% less), though for a longer period of time, less memory, 2GB less, while utilizing the NPU.
However, it has the drawback of outputting nonsense(a bunch of !!!!!!)
The model is quantized using awq scales generated. Here, the CPU and NPU speeds do not differ much. The biggest difference being seen in the load times, with the load time in the CPU being about 4.5 times faster that that of the NPU
- However, The CPU uses about 15GB of memory while the NPU only uses about 8GB equal to the model size.
CPU
NPU
Transformers 1.2 comes with the ability to compile llama.cpp for the AMD Phoenix NPU. Following the steps outlined, I was able to run Meta-Llama-3-8b-Instruct.Q4_0.gguf on both the CPU and the NPU.
CPU
This ran the best providing 8.32 tokens/sec with only 4GB memory usage and 39% CPU usage
Speed
NPU
On the NPU, the models runs slower with 1.87 tokens and about 9 GB memory usage. The load time is also about 4 times as slow
Generation results
From the demonstration video at the top, we can see that the assistant is able to perform a generalized range of tasks with a good generation speed on the CPU.
Benchmarking the power consumption on the CPU when the model is running on the CPU and NPU I found that running on the NPU is about 20W less that when the model is running fully on the CPU
Though model inference on the NPU seems to be a bit slow, it is important to note that the NPU used on the 7940HS is the Phoenix NPU which is a lot slower than that of the Strix NPU on newer processors. This project highlights the capability and potential of offloading personal assistants to the NPU a lot less power consumption and CPU utilization
6 Features to be ImplementedThis is still an ongoing project and aims to implement the following features:
- More functions giving more control over the user's PC
- Realtime human speech output. (technically implemented with pytts3 but it is very machine like and hence weird)
- Realtime speech input to the model
- NPU-GPU offloading to speed up inference
- Inference on Images
I would like to thank AMD for the support in this project and for awarding the UM790pro used in the project.
Comments