Imagine having a companion you can pull out from your bag, plug into a power source, ask questions and get help instantly! Who wouldn't love a companion that can help them anytime and anywhere with their tasks? In this project, we have attempted to build a completely offline and accurate chatbot, capable of helping the user with basic tasks even while not being connected to the internet.
BackgroundWith the rapid development of Large Language Models and global research efforts, we are witnessing the rise of personalized AI assistants capable of ingesting huge amounts of information and providing key insights around the clock. But unfortunately, better assistants come at the cost of huge amounts of compute and memory. The size isn’t the only problem. Hallucinations, information loss, context loss, privacy issues, etc., plague the realm of large language models which might make it seem that they are far from perfect. To address these issues, many techniques and advanced devices have come up. I’ve highlighted a couple of relevant ones below.
AMD Ryzen AI:
AMD recently revealed an exciting lineup of AI enabled processors which are shipped with a dedicated AI engine powered by AMD’s XDNA architecture. Ryzen AI helps offload resource intensive AI tasks from CPU and GPU, onto the Neural Processing Unit (NPU) to free up resources for general tasks. By offloading a part, or all of the AI workload, the AI engine and NPU together provide a significant speedup in inference and a truly offline AI experience. Depending on the processor, these chips can provide anywhere between 10-85 TOPS in performance (🤯).
Quantization:
Quantization is the process of converting floating point operations of a neural network to integer point operations thus significantly reducing the energy and computational needs of a model without significant losses in performance. This method has become a popular tool to deploy large models over the edge or smaller compute clusters.
RAG:
As fine-tuning costs and memory loss of models increased, researchers started looking at the problem of reliability of models from a new perspective. They came up with a concept called Retrieval Augmented Generation in which the LLM surveys a provided knowledge base and provides factual information based on the knowledge it has essentially seen. This became the state of the art to tackle hallucinations and improper responses and paved the way for reliable chat assistants.
The IdeaOur original idea was to come up with an end-to-end voice based chat assistant capable of executing tasks and providing reliable responses while entirely running offline. To do this, we used the power-packed AI-enabled Mini-PC from Minisforum (Minisforum Venus UM790 Pro). This came equipped with the the powerful AMD Ryzen 9 processor which comes integrated with Ryzen AI Technology.
We began exploring available models and after thorough research realized that task execution wouldn’t fit into the timeframe of the project. We explored fine-tuning LLMs for developing a chatbot type application, but this raised the issue of reliability. Fine-tuning LLMs can lead to memory losses and make the model forget what it had previously learnt.
After further exploration, we stumbled upon RAG and began our implementation. We had multiple tools available at our disposal to implement this. After going through multiple blogs and options, we decided to create a RAG based chatbot using AnythingLLM and LM Studio.
LM Studio helps in creating an LLM inference server by using optimized models from huggingface. AnythingLLM provides a chatbot UI and a RAG overlay which helps us build RAG based chatbots quickly.
This detailed tutorial outlines the steps needed to achieve the above setup: How to enable RAG (Retrieval Augmented Generation)... - AMD Community.
Once the setup is done, we download Zephyr 7B model to create our backend. As shown above.
It should look like the above image once the server is started.
After you have setup AnythingLLM with the steps from the blog above, you can add all kinds of information to it. For our particular use case, we demonstrate how this approach can help a tourist plan at remote areas without internet. A tourist (like me) can simply power on their PC, and ask the chatbot to recommend food, places, and maybe even a couple of discounted gift shops!
AnythingLLM can scrape websites directly. You can simply add the websites you want to use for information and AnythingLLM will use its internal models to embed the information into local vector databases. Each set of documents (or any other type of file!) stays in the workspace you have added it to. This can also help in separating confidential information from other chats. Both of these softwares don't send data out of the device thus making them a secure choice for chatbots. They support AMD GPUs thus allowing fast responses.
ResultsThe following images show a couple of examples where the offline assistant helps me know more about Hyderabad.
AnythingLLM also supports Speech-To-Text (STT) and Text-To-Speect (TTS) out of the box. Thus this can be further expanded into voice controlled chatbot that runs locally thanks to the powerful AMD hardware underneath.
Further explorationsWhat fun is it if we don't use the NPU? I also need to do other tasks apart from chatting with the AI all day!
We tried offloading Zephyr to the NPU by attempting to quantize the original model from hugging face. We attempted for basic inference without any optimizations and the hardware wasn't able to handle it.
This was because the model was too large for the RAM and wasn't optimized for local inference. To do this, we tried to quantize the model after exporting it to ONNX so that we could run it with ONNX runtime and VitisAI Execution Provider.
We successfully exported the Zephyr model to the ONNX runtime but weren't able to quantize it properly. The codes used for this are present in the repository at the end. We attempted to quantize it using smoothquant and VitisAIQuantizer.
We also tried running the Zephyr model directly with Vitis AI on ONNX Runtime but the kernel kept crashing as the model was too large to be loaded onto the memory.
The reason LM Studio and AnythingLLM worked out was because they use GGUF formats with a variety of quantized models. Each model is available with a range of quantizations and the Zephyr GGUF model was about 5 GB in size compared to the original which was 25 GB in size. We also tried exporting this GGUF model to ONNX but it requires dequantization. It didn't work out as the library wasn't able to read the GGUF model into a PyTorch model.
We also came across new examples in the RyzenAI GitHub repository but we were already nearing the submission deadline by the time we tried to implement these examples. We believe that the added support and tutorials for Mistral based models can now help us efficiently quantize our model get it running on the NPU.
ConclusionsGGUF based models are hardware friendly and can efficiently run on the edge without a lot of compute. Support for GGUF based models on the NPU would be a leap ahead for fast inference on local hardware.
The world of AI grows exciting everyday! We see new models and optimization methods come up each day and research is bringing us closer to AGI and models which can run on the Edge!
With dedicated AI engines such as those offered by AMD, we will for sure see a future where users and models work hand in hand at peak productivity or maybe just talk like good buddies. (I sure do hope to see J.A.R.V.I.S from Iron Man come to life).
Future ScopeWe do have plans to expand this project in the future. Following are a set of avenues which we will be exploring to run these models efficiently.
- Use the new tutorials to perform smooth quantization on exported ONNX of Zephyr.
- Minimize the model size through compression techniques.
- Attempt to run the model with Vitis AI and expose server to AnythingLLM to create a RAG application.
- Add task execution support with STT and TTS to make a voice controlled assistant which can execute tasks at your command.
Comments
Please log in or sign up to comment.