We believe that with improved conversation we can resolve many issues that are effecting our world today. Probably most of us can relate to a conversation going south because of an argument as a result of miscommunication. Disagreements in human interactions during conversations can escalate tensions and can be detrimental for the communication. Whether in personal life or at work, arguments may arise due to different views and believes, but many times the disagreements are caused by miscommunication or misunderstandings. If the understanding of each other's views, ideas, and opinions could be improved, the risk of a conversation resulting in an argument would be greatly reduced.
There are already some artificial intelligent (AI) assistants that work in messaging platforms that can improve your writing and with that improve the conversation. But in person communication is even more important, with that in mind we built CAMP: Conversation Assistant for Miscommunication Prevention. CAMP monitors in person conversations trough a microphone and provides responses trough a speaker.
CAMP can be used in work environment where improved understanding results in greater work efficiency or even in private conversations. It is fully locally hosted on a AMD Ryzen AI PC. The local solution makes CAMP more integrated and is more potable. It also resolves the privacy concerns with recording conversations.
SolutionCAMP is an intelligent conversation assistant that monitors conversation and jumps in with helpful advice when detecting miscommunication to try to solve and deescalate the situation.
We designed CAMP to be modular, consisting of multiple servers. The servers provide interfaces to AI models for Speech To Text (STT), Large Language Model (LLM) and Text To Speech (TTS). The main logic is contained in the main application server, that at the same time also implements user interface trough a webpage. Everything is deployed and running on AMD Ryzen AI PC. Below is a schematic of the CAMP system.
The user interface can be access on the same device through via browser where local microphone and speaker can be used or connect to it remotely on a local network.
For model inference we used processing power of the AMD Ryzen AI PC. For the development we used Minisforum Venus UM790 Pro with AMD Ryzen 9 and Windows 11. AMD Ryzen AI processors have integrated Neural Processing Units (NPU) which are designed for model inference. NPU provides additional resources that offload processing from the main CPU. At the same time, NPU are designed to be more power efficient.
DemonstrationFirst let us look at the final application. The user interface to CAMP is trough a webpage. When we arrive on the CAMP webpage, we are provided with a short explanation of the application. When we press the "Start CAMP" button, the audio recording is enabled and shortly we can observe the STT processed the conversation transcript. When CAMP detects miscommunication, it automatically speaks up with a helpful response.
Here is also a video demonstration of CAMP. For the input we used a conversation from the popular television show Friends.
At the begging CAMP waits for some lines of the transcript to be available. Then the transcript is processed by asking LLM a question if there is miscommunication or arguing in the conversation. If the LLM generated answer is 'yes', LLM is used to generate a response to help with the conversation. After the full response is generated, TTS model is used and the response played through a speaker. During LLM response generation, CAMP continues to listen and transcribe the conversation. The new transcript is analyzed in the next iteration when LLM is free.
UsageWe start CAMP by pressing the "Start CAMP" button (Figure 1). Audio recording is started and the STT provides the transcript which is displayed in "Conversation messages" section (Figure 2).
CAMP then periodically processes the conversation using LLM. We can monitor LLM operation in the "LLM messages" section (Figure 3). When something is detected the response will be generated using TTS. The generated speech will be automatically played trough the connected speaker (Figure 4).
We can reset the already processed conversation by pressing "Reset conversation" button (Figure 5). Or we can download the conversation transcript by pressing "Download conversation" button (Figure 6). This is especially useful in meetings where we can then reference it later.
ImplementationWe implemented the AI model servers using Python with FastAPI and deploying them with Uvicorn. The AI model servers are listening on localhost on different ports and provide API endpoints. The developed API closely follow the ones form OpenAI. With separation of the servers, we got a modular project that allows interchangeability of the servers in the future and allowed us to split the development in smaller parts.
LLM model used in the CAMP project is Llama2 that is deployed on AMD Ryzen AI NPU. We quantized the model using the RyzenAI-SW v1.1, where we followed the transformers v1.1 and llama2 v1.1 example. The developed server code is accessible in CAMP-llama2 repository. The server can be deployed with different variations of quantized models. Mostly, we used 3-bit quantized model with quantized lm_head and flash attention enabled. The model performed very well for our application. Besides batch response, we also implemented streaming, which makes the application more responsive. In the server, we implemented both chat and generate endpoints. For the final CAMP application we used chat endpoint with streaming.
For the STT server we used Faster-Whisper implementation of the Whisper model. We build our own server around the Faster-Whisper library. The server code implementation is accessible on CAMP-SpeechToText. For the model we used distil-small.en. The transcription runs on the AMD Ryzen CPU which is fast enough to process audio files that are recorded with 2 second intervals of 5-15 second in length. This provides response in real-time.
The TTS library used is pyttsx3. We crated our own server to expose the library capabilities. The model runs on the AMD Ryzen CPU which provides very fast responses. The developed server is accessible at CAMP-TextToSpeech.
To combine everything together and develop the CAMP application, we used Streamlit, which is a python server for quick development of front-end applications. The downside of Streamlit is that it is not an asynchronous framework. But we managed to build around this, by initializing additional thread that runs tasks with requests to the AI model servers. We also had to develop our own recording fronted component as the existing ones do not provide streaming recordings. As the basis for our component we used B4PT0R/streamlit-mic-recorder. We modified it to provide overlapping audio recording of 5-15 seconds periodically on 2 seconds. The component is paused when the audio playback of the response is being played. With the overlapping audio recordings we enhanced the Whisper transcription, as the transcription works best in the middle of the recording. We then developed component to calculate the overlap and stitch the transcripts together. With this process we also reduced issues with Whisper when there is no speech in the audio recording. The CAMP application then checks the transcript when enough of the new one was received or there was a pause in the conversation. The check is done by sending the conversation to the LLM server. The transcript is extended with engineered prompts to get predictable results. In the first stage LLM is asked to analyze the conversation and with Yes or No answer if there is any miscommunication or arguing. If something is detected LLM is used to generate response that is then sent to TTS and played on the speaker trough the browser.
The final deployment schematic is presented below.
AI model servers are on localhost and listen on separate ports, while the Streamlit CAMP application can be accessed on the same machine through localhost or on the local network from a different PC.
DevelopmentAs already stated during development we benefited from our design choice of separating the models to their own servers even though they would run on the same machine.
We developed many test applications to evaluate how different components work. From the tests we included some in the final application.
For the prompt engineering for the LLM we used conversations from the already mentioned TV series Friends. We used the flowing dataset character-mining. By pressing the button "Dataset prompt" (Figure 1) one dialog replay of the characters in the series will be added to the conversation. This will be then processed by CAMP and if miscommunication is detected CAMP will provide and play a response.
Another included test in the final application is the "TTS test" (Figure 2). Which sends a test prompt to the TTS server and then playbacks the generated speech.
InstallationThe final CAMP project is composed of 4 repositories heavily relaying on the AMD RyzenAI-SW v1.1 for the quantization and deployment of Llama2 model:
- CAMP: Main application based on Streamlit
- CAMP-llama2: LLM Llama2 server using AMD Ryzen AI NPU
- CAMP-SpeechToText: STT Whisper server
- CAMP-TextToSpeech: TTS pyttsx3 server
To deploy the CAMP system download the repositories and follow the instructions provided for each of the servers in their READMEs. We suggest that you create a folder where you download all of the repositories and then deploy the AI model servers first and the main CAMP application at the end. For the deployment you will need Python. Additionally, you will need Ryzen AI v1.1 NPU driver and RyzenAI-SW v1.1. Follow the AMD tutorials v1.1 to prepare Ryzen AI environment and quantize the Llama2 v1.1 model. Some additional help can be also found in the CAMP-llama2 README.
Conclusion and future ideasIn this project we successfully implemented our in initial idea for CAMP as a conversation assistant for miscommunication prevention. We closely followed our proposal for the AMD Pervasive AI Developer Contest. We combined AI models for STT, LLM and TTS to provide conversation capabilities and reasoning. All of the models and the main application was deployed on the AMD Ryzen AI PC, also leveraging the integrated NPU. This frees up space for the CPU and reduces power consumption. We also observed lower memory usage compared when the model is deployed on the CPU, especially on idle when the model is loaded but not used.
CAMP system was preforming very well during testing. There are still some optimization to be done with delays to provide faster responses. But in application for miscommunication prevention this is not as much off a problem. The biggest delay in the system is when application waits for the whole response to be generated, even though we receive it with streaming, as we were unable to extend Streamlit to do streaming audio playback. This is one of the TODOs for the future.
We also experimented with deploying Llama3 and Whisper models on the NPU but were unfortunately not fully successful. This is another thing we will investigate in the future. We also plan to additionally extend CAMP capabilities with providing conversation summary or similar applications.
As we have work more in the embedded world we gained a lot of new knowledge about AI models, inference, Python and front-end development.
Comments