Published July 31, 2024 © Apache-2.0

Ray: Empowering Your Digital Life

Ray is a cutting edge AI assistant that runs locally on your PC, providing personalized support and automating tasks to save you time.

AdvancedWork in progress2 hours159

Things used in this project

Hardware components

Minisforum Venus UM790 Pro with AMD Ryzen™ 9

Software apps and online services

Microsoft VS Code

AMD Ryzen™ AI Software

Microsoft Visual Studio 2019

Story

1. Introduction

Welcome to Ray, your personal AI assistant designed to make your life easier and more enjoyable. With the ability to learn and adapt to your unique needs and preferences, Ray is more than just a conversational tool - it's a personalized companion that can help you manage your digital life, complete tasks, and even provide a dash of humor and entertainment.

Running locally on your PC, Ray leverages the power of AMD's Ryzen AI software to deliver a range of innovative features, from email management and app downloads to PDF summarization. With Ray, you can enjoy the benefits of AI-powered assistance without the need for an internet connection, making it the perfect solution for those who value privacy and control.

Demo of RAY running on the NPU and the CPU

2. Key Project features

2.1 Chat Interface

chat interface

Currently the user interacts with the model through a chat interface built as a webpage but running on an app contained in a minimal cefSharp Web Browser built using C#. This ensures that the app does not require the user to open a browser which saves on additional memory. The chat interface runs on startup and is implemented as follows

During normal operation mode when the user is interacting with other windows, the app is minimized to display only a time widget at the top right that does not affect interaction with the window below it.
When the user wants to open maximize the app, all they have to do is click on an icon in the system tray.

The chat interface has three chat options

Chat - This allows for normal conversational mode
PDF - This allows the user to interact with PDFs

Tools - This allows the user to call tools available to the assistant

The chat interface has three chat options

1. Chat - This allows for normal conversational mode

2. PDF - This allows the user to interact with PDFs

3. Tools - This allows the user to call tools available to the assistant

2.2. Conversational Memory

For conversational memory, Ray uses a vector store (ChromaDB) to dynamically store user messages as they come in. This is used in conjunction with a conversational buffer window that supplies the last 4 interactions in addition to 4 most relevant messages from past interactions. For a considerable speed up, this is implemented from scratch.

2.3. RAG

For PDF document interactions, an embedding model (all-mpnet-base-v2) is to create embeddings which are then stored in a vector database. The vector database is then queried on the user prompt returning the best 4 relevant chunks which are then passed to the model along with the user prompt. This is also implemented from scratch allowing for more flexibility.

The user can upload a PDF using the chat interface and the assistant will split the document using LangChain's RecussiveCharacterTextSplitter then store the chunks in a vector database in memory for querying

2.4. Tool Calling

This is the major advantage of using Meta's LLama3 model. With a custom implementation of function calling, the assistant is able to perform tasks on the users computer that many others can't. Currently the assistant only has access to the following functions

Getting random jokes from the internet
Getting the current weather
Sending emails
Downloading and installing apps for the user using winget

2.5. Large Language Model

As mention before, the assistant is running on Meta's LLama3-8B-instruct model. This model is quantized to using RyzenAI LLM quantizer dropping its size down from 16 GB to 7.8 GB for the eager mode. An onnx model is also available for CPU execution at 5.6GB

Due to the slow inference of the model on Phoenix NPU, Ray implements both CPU execution with onnx quantized model while also having the ability to run on the NPU on much faster Strix NPUs

3. Running the Model using Ryzen AI SW

Ray offers 2 possibilities:

1. Running on the CPU

2. Running on the NPU

On the 7940HS processor, the model runs faster on the CPU with a tokens/sec of 7.8 while also being able to run on the Phoenix NPU with about 2.3 tokens/sec. The benchmarks are given later.

There are 3 ways to run models on the NPU:

Quantizing the model to onnx
Eager mode - Quantizing the model to a pytorch model and running with torch
Using llama.cpp

For this project, I tried all three methods and eventually settled on onnx for its simplicity as well as resultant model size on the CPU and pytorch model for the NPU.

3.1 Preliminaries - Installing Ryzen AI v1.2

For this project I used Ryzen™ AI Software to get the 1.2 driver as well as files needed to run transformer models.

To install the driver, you follow this tutorial on the official page.

The 1.2 driver version comes with the advantage of showing NPU utilization on the task manager.

NPU display on task manager

3.2 Running Onnx Models

To run a model using the onnx flow, the following steps are followed; This project was built while the 1.2 version was still in early access but I will provide the documentation to run the the official release

Download the model from Hugging face - In this case, I chose to use Meta-Llama-3-8B-Instruct. However, any LLM that can fit into memory should work
Setup the transformers1.2 environment
Use transformers 1.2 to quantize and save the model to onnx

The steps to setup the ryzenai-transformers environment as well as quantize the model are included in the GitHub repository's read me file.

To run the model however, another environment is setup to prevent conflict

3.2.1 Setting up the environment for model inference

To run onnx model, a separate environment is setup. In this case, the environment can be setup by following this tutorial on AMD official release

3.2.2 Quantizing the model to onnx

This is done by editing the prepare_model.py file to include my local model. Then running the file with --export and --quantize arguments. Optimization requires a more than 64 GB ram which I do not have.

python prepare_model.py --model_name "D:/llama3-8b-instruct" --export --quantize

This process takes a lot of disk space so ensure you have more than 100GB of free storage on disk.

3.3 Running on Eager Mode

To get a model that I could run on eager mode, I followed the documentation on trasnformers/models/llm from which I was able to generate a checkpoint used for model inference.

4 Results

4.1 Onnx results

Running 8B onnx model on the CPU gives about 7.8tokens/sec which matches that of Ollama. However, due to the method of implementation of the memory, the time to first tokens doesn't get as adversely affected as in Ollama which simply pushes the entire current history.

Running on the NPU requires one to run setup_phx.bat to ensure that the correct .xclbins are loaded. Running on the Phoenix NPU produces 2.3tokens/sec with a 3 times longer load time that that of the CPU provider. The Model does, however, end up using less CPU

processor utilization

As seen, the model uses less CPU(30% less), though for a longer period of time, less memory, 2GB less, while utilizing the NPU.

However, it has the drawback of outputting nonsense(a bunch of !!!!!!)

4.2 Pytorch - Eager Mode

The model is quantized using awq scales generated. Here, the CPU and NPU speeds do not differ much. The biggest difference being seen in the load times, with the load time in the CPU being about 4.5 times faster that that of the NPU

However, The CPU uses about 15GB of memory while the NPU only uses about 8GB equal to the model size.

CPU

Eager mode on the CPU

NPU

Eager Mode on the NPU

4.3 llama.cpp

Transformers 1.2 comes with the ability to compile llama.cpp for the AMD Phoenix NPU. Following the steps outlined, I was able to run Meta-Llama-3-8b-Instruct.Q4_0.gguf on both the CPU and the NPU.

CPU

This ran the best providing 8.32 tokens/sec with only 4GB memory usage and 39% CPU usage

Speed

NPU

On the NPU, the models runs slower with 1.87 tokens and about 9 GB memory usage. The load time is also about 4 times as slow

Generation results

5 Conclusion

From the demonstration video at the top, we can see that the assistant is able to perform a generalized range of tasks with a good generation speed on the CPU.

Benchmarking the power consumption on the CPU when the model is running on the CPU and NPU I found that running on the NPU is about 20W less that when the model is running fully on the CPU

CPU power consumption with the model running on CPU

power consumption with the model on the NPU

Though model inference on the NPU seems to be a bit slow, it is important to note that the NPU used on the 7940HS is the Phoenix NPU which is a lot slower than that of the Strix NPU on newer processors. This project highlights the capability and potential of offloading personal assistants to the NPU a lot less power consumption and CPU utilization

6 Features to be Implemented

This is still an ongoing project and aims to implement the following features:

More functions giving more control over the user's PC
Realtime human speech output. (technically implemented with pytts3 but it is very machine like and hence weird)
Realtime speech input to the model
NPU-GPU offloading to speed up inference
Inference on Images

7 Acknowledgements

I would like to thank AMD for the support in this project and for awarding the UM790pro used in the project.

Ray: Empowering Your Digital Life

Things used in this project

Hardware components

Software apps and online services

Story

1. Introduction

2. Key Project features

2.1 Chat Interface

2.2. Conversational Memory

2.3. RAG

2.4. Tool Calling

2.5. Large Language Model

3. Running the Model using Ryzen AI SW

3.1 Preliminaries - Installing Ryzen AI v1.2

3.2 Running Onnx Models

3.2.1 Setting up the environment for model inference

3.2.2 Quantizing the model to onnx

3.3 Running on Eager Mode

4 Results

4.1 Onnx results

4.2 Pytorch - Eager Mode

4.3 llama.cpp

5 Conclusion

6 Features to be Implemented

7 Acknowledgements

Code

RAY

Credits

Brian Macharia

Comments

Embed the widget on your own site

Ray: Empowering Your Digital Life

Ray: Empowering Your Digital Life

Things used in this project

Hardware components

Software apps and online services

Story

1. Introduction

2. Key Project features

2.1 Chat Interface

2.2. Conversational Memory

2.3. RAG

2.4. Tool Calling

2.5. Large Language Model

3. Running the Model using Ryzen AI SW

3.1 Preliminaries - Installing Ryzen AI v1.2

3.2 Running Onnx Models

3.2.1 Setting up the environment for model inference

3.2.2 Quantizing the model to onnx

3.3 Running on Eager Mode

4 Results

4.1 Onnx results

4.2 Pytorch - Eager Mode

4.3 llama.cpp

5 Conclusion

6 Features to be Implemented

7 Acknowledgements

Code

RAY

Credits

Brian Macharia

Comments

Related channels and tags