Published April 1, 2024 © GPL3+

Cooking meals with a local AI assistant on Jetson AXG Orin

Build a multimodal, multi AI agent, fully local, conversational chatbot with multi agent research capabilities via speech queries.

AdvancedFull instructions providedOver 6 days1,761

Generative AI Applications : 3rd Place

AI Innovation Challenge

Cooking meals with a local AI assistant on Jetson AXG Orin

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Desktop Microphone, Unidirectional

Webcam, Logitech® HD Pro

Software apps and online services

Ollama

CrewAI

Langchain

Story

Introduction

In this project we will design a multimodal, multi-agent conversational AI assistant that can assist with cooking chores such as providing recipes, explaining how to pair food ingredients and provide feedback on dietary questions. We shall leverage middleware such as Whisper, Piper, Ollama, Langchain, CrewAI and multimodal LLMs for implementing a fully local chatbot with vision and speech capabilities.

Preparing the hardware

A Jetson AGX Orin was used for this project courtesy of NVIDIA. This board contains 64GB of integrated GPU RAM. It also comes with a 64GB eMMC which stores the JetPack 5.1 OS based on Ubuntu.

Jetson AGX Orin

When using the system for the first time one will find out pretty quickly that using an external SSD drive is crucial due to the limited space. Hence I ended up installing a 1TB SSD. The next step was connecting a USB speaker phone and a USB camera. That is all the hardware that is required for the project.

Preparing the SW environment

First update the JetPack with the latest updates.

sudo apt update
sudo apt dist-upgrade
sudo reboot
sudo apt install nvidia-jetpack

Preparing the external SSD

sudo mkfs.ext4 /dev/nvme0n1p3
sudo mkdir /agxorin_ssd
sudo mount /dev/nvme0n1p3 /agxorin_ssd
sudo chmod 755 /agxorin_ssd

When editing the fstab the drive UUID is found by issuing :

ls -lha /dev/disk/by-uuid

sudo nano /etc/fstab
sudo umount /agxorin_ssd
sudo mount -a

Next, change the docker registry to point to NVMe so that the docker images are stored in NVMe.

sudo mkdir /agxorin_ssd/docker-data 
sudo nano /etc/docker/daemon.json

and add

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
        }
    },
    "default-runtime": "nvidia",
    "data-root": "/agxorin_ssd/docker-data"
}

This can be checked by issuing:

docker info -f '{{ .DockerRootDir}}'

As an add-on you can also enable some swap

sudo systemctl disable nvzramconfig
sudo fallocate -l 16G /agxorin_ssd/16GB.swap
sudo mkswap /agxorin_ssd/16GB.swap
sudo swapon /agxorin_ssd/16GB.swap

At this point the user can proceed with the installation of the docker containers and main software stack.

Software stack overview

The local AI chatbot is implemented as a distributed voice chat application that runs locally on AGX Jetson Orin. The components are fully open source and there is no dependence on cloud apps.

The cooking assistant uses Ollama with Mistral, Openhermes and Llava 13billion parameter model as the main LLMs. Piper TTS engine is used to convert the LLM responses to speech. Faster Whisper is used to implement almost real time voice transcription.

The application is composed of the following modules:1. Voice transcription application which converts speech audio frames into text.

2. Multimodal Llava:13 billion parameter LLM with vision capabilities that observes surrounding environment and uses that to augment system prompt

3. Main Mistral LLM which is used for conversing with user

4. CrewAI based research module that takes task commands from orchestrator LLM (OpenHermes)

5. Text-to-speech service using PiperTTS: converts text into speech

6. RAG capability for adding obscure recipes to avoid LLM hallucination - this is a bonus addon that needs to be integrated

Voice Transcription service

The Faster Whisper Docker recipe from Jetson containers was used to build a custom Docker with Faster Whisper, Jupyter Notebook. Compared to the vanilla whisper implementation the faster Whisper library allows 2-4x faster transcription which approaches real time communication latency.The Pyaudio library was used to set up the microphone configured with a mono channel and sample it at 16KHz. To detect if there is speech the python webrtcvad library was used to add support for Voice Activity detection. This feature essentially optimizes sampling by reading audio samples only when the speech threshold is above a programmable level.

./build.sh --name=voiceai faster-whisper jupyterlab jetson-utils

The Python async libraries were used to run the task that implements transcription. A websocket client was implemented that takes the transcribed text and sends it to a websocket server. Experiments showed that using the quantized versions affects the accuracy of the transcription. Hence the float16 medium.en Whisper model was used. The voice activity detection settings are also important as sometimes once can observe repetition of the utterance.

To run the Docker image that was built run:

./run.sh --volume /agxorin_ssd/jetson-containers/apps:/home $(./autotag voiceai)

The folder contains a shell script with all the required dependencies that need to be installed on the Docker container. The other option is to modify the Docker recipe and include those in there in order not to reinstall on every run.

sudo apt-get update
sudo apt-get install libasound-dev
sudo apt-get install libportaudio2
sudo apt install portaudio19-dev
sudo apt install ffmpeg wget -y
sudo pip install pyaudio
sudo pip install webrtc
sudo pip install webrtcvad
sudo pip install websockets

The transcription application implements the first part of the conversational AI application. The use of websockets was deemed more flexible to gRPC.

Faster whisper program on the right sending data to websocket server on the left.

The medium.en model was used to balance between latency and accuracy. Note that the delay from utterances to transcription depends on model size, beam selection and float format.A beam size of 5 was used. Multiple tests show that the faster-whisper model sometimes also hallucinates as shown in this Git issues post on [1].

Text to voice service

The text to speech engine implementation initially was implemented using pyttx3 library using the espeak-ng. That approach works however it provides an unnatural intonation. The following libraries need to be installed.

sudo apt update 
sudo apt install espeak ffmpeg libespeak1
sudo apt-get install alsa-utils

To address this, the Piper TTS library was used which implements a real time text to speech engine that allows customization with various languages and voice tones. The Piper TTS engine also uses the espeak-ng under the hood

pip install piper-tts

You can test the voices in here:

https://rhasspy.github.io/piper-samples/

The two voices that were downloaded from Huggingface were Alba and Amy.I found Alba's (Scottish English) voice the most pleasant one to listen to.

https://huggingface.co/rhasspy/piper-voices/tree/main/en/en_US/amy/medium
https://huggingface.co/rhasspy/piper-voices/tree/main/en/en_GB/alba/medium

Both are medium fidelity voice models which has the best tradeoff between latency and performance. For each voice model the.onnx model, the Model card and the JSON file need to be present on the app directory.

en_GB-alba-medium.onnx
en_GB-alba-medium.onnx.json
en_en_GB_alba_medium_MODEL_CARD

The voices are switched on the fly by loading the corresponding.onnx file.

async def play_AlbaVoice(text, device_index=0):
    engine.load('en_GB-alba-medium.onnx)
    engine.say(text)

The text to speech application implements a Websocket server. It starts the Piper TTS engine and waits for transcribed text from the Docker transcription service that runs the faster-whisper engine. In addition the scripts implements filtering to remove intonation characters that slow the natural speech flow.Once the transcribed query is received via the Websocket Docker client app it is sent to the main LLM via a request. The reply from the async client chat is then passed directly to the Piper engine. The demo below shows testing of the text to speech service.

Text to Speech using Piper

LLM server

The Ollama framework was used to interface using multiple LLM models. This is a Golang middleware application that makes it easy to instantiate LLM models via a REST API service. Initially I installed Ollama on AGX Jetson Orin by issuing :

curl https://ollama.ai/install.sh | sh

Due to the relatively slow token output, it immediately became clear that everything was running on the 12 CPU cores and nothing was running on the GPU. After opening an inquiry on this thread [2], it was found that Ollama required a patch to support offloading to GPUs on Nvidia Jetson devices.

GoLang Installation

wget https://go.dev/dl/go1.21.6.linux-arm64.tar.gz
sudo tar -C /usr/local -xzf go1.21.6.linux-arm64.tar.gz
export PATH=$PATH:/usr/local/go/bin
source ~/.bashrc
go version

The Ollama application is not currently offered as a Docker container that runs from GPU so it was installed on the emmc. The models however are fairly large so they need to be moved on the SSD.

Install Ollama from the development 3rd party repo:

git clone https://github.com/remy415/ollama.git
git clone --depth=1 --recursive https://github.com/remy415/ollama.git

export OLLAMA_LLM_LIBRARY='cuda_v11'
export OLLAMA_SKIP_CPU_GENERATE='yes'
go generate ./...
go build .

Create a symlink to move the model folder to the SSD by creating a symlink to the models folder.

Start the Ollama server.

sudo ollama serve

Ollama testing

Ollama allows the user to customize the models via a Modelfile. This feature is crucial for this application as we will ask the LLM to roleplay like an acclaimed chef.

Download the models

For this project, the following models were used :

Mistral
OpenHermes
Llava:13b

ollama pull mistral
ollama pull OpenHermes
ollama pull Llava:13b

The main application uses Conda with at least python 3.10

conda info --envs
conda activate aibase

At this point we can test the end to end pipeline by having the ASR app running on one terminal and the text to speech App_TTS with the Ollama LLM on the other terminal.

python3 mainApp_ttsollama.py

You can use a python websocket client to test if needed on the same App_TTS directory.

Main Application Demo Pipeline

The system main block diagram below shows the complete multimodal pipeline of the conversational chatbot. You can see that the system makes use of multiple LLM models.The first demo below shows how to communicate with the LLM. Note that the current implementation does not support speaker diarization so only one speaker is supported at a time.

Conversing with AI model for cooking recipes?

The main branch is composed of a Dockerized python app websocket client that uses the faster-whisper network to convert utterances to text. The main LLM used for communicating with the chatbot is a customized version of Mistral. The modified Ollama Modelfile is shown below.

The transcribed queries are sent to the LLM using asynchronous requests. The LLM replies are received asynchronously and sent to the Piper text to voice transcription task. The Mistral LLM voice uses the female Alba voice.

Simply by using the bottom branch one can effectively converse with a local LLM with no dependency on cloud services. Apart from the system prompt on the Modelfile the Mistral model LLM lacks a context of the surroundings. To address this we have to add camera feedback to be used as context for the AI agents.

Adding agentic capabilities with vision and speech

A new branch of the pipeline was implemented to provide the LLM the ability to observe it's surroundings and use that as a a context in conjunction with the voice queries for the main LLM that responds to the tasks that the user asks.

CrewAI using OpenHermes in action finding recipes from Internet

To imbue the chatbot with an agentic character, the crewAI package was used. The program uses openHermes LLM for both the researcher and the writer agent. It is also possible to use different LLM for different agents or the same one for all. The only downside at the moment is the additional latency by switching the LLM in VRAM since the LLM models are running sequentially. None of the frameworks at the moment support running concurrent LLM models on the same GPU even if both LLMs can fit in VRAM memory.To support agentic behavior we'll have to install some additional libraries.

conda create -n aibase python=3.10
conda activate aibase
pip install open-interpreter
pip install crewai
pip install unstructured
pip install langchain
pip install Jinja2>=3.1.2
pip install click>=7.0
pip install duckduckgo-search
pip install --upgrade --quiet duckduckgo-search

The CrewAI framework has similar functionality to Autogen but is more flexible. This package allows using multiple LLM agents with tools such as internet search engines to accomplish tasks. The user need to provide a context and a task for each agent. The context in this case is provided by the vision LLM which receives a description of the camera frames by a LLava 13Billion model. The steps to get this working are:

1. Start Ollama server

sudo ollama serve

2. Start Docker faster-whsiper with speech recognition app with the Git repo located under jetson-containers/apps/.

./run.sh --volume /agxorin_ssd/jetson-containers/apps/:/home $(./autotag voiceai)
./installaud.sh
python3 main.py

3. Start websocket webserver with crewAI with vision context plus vocal query on a seperate terminal.

conda activate aibase
cd App_VisionCrewAI 
python3 app.py

Once a voice query is received, the camera snaps a photo of the surroundings and passes it to Llava:13B the output of which it uses as a context for the query for the crewAI agentic LLM crew together with the vocal query. The output of the findings of the crewAI team is narrated by the selected TTS engine.

The demo below shows the complete multimodal AI with voice, speech and reasoning capabilities by using multiple LLMs. The proper audio output device has to be selected as Ubuntu does not do this automatically.

1 / 6 • Crew AI agents with video input context and voice feedback

Since Ollama on AGX Orin can only run LLMs sequentially if one desires to have a conversation with multiple agents the top branch that implements the Video to CrewAI text to speech pipeline can be run on a separate AGX or a remote local GPU.

The short summary of the pipeline is as follows:

Utterance converted to text and received as WebSocket packet on WS server.
Once a voice message is received, a camera frame is taken , passed to Llava.
The camera frame description is used as context.
The transcribed query message is also used as a task
Both task and context are passed to CrewAI agents which use DuckDuck for search.
The report from the cooking chef and the cooking assistant is narrated by Amy.
This pipeline runs independently of the main LLM and can run on another AGX Orin.

Limitations

1. When the Modelfiles were edited the context length was set to the maximum which is 8000 tokens for Mistral. The models do not have memory so they don't remember chat history beyond that context. One option is to use the Langchain Runnamble Memory or a third party library like memgpt.

2.The ASR pipeline can be augmented with speaker diarization to allow multiple speakers to different LLMs. The ASR latency and hallucinations of the faster-whisper are the main bottlenecks of local chatbots.

3. Currently Ollama and Nvidia GPU stack do not support loading multiple resident LLM models on VRAM and switching dynamically between them.

4. Open LLMs do not have good support for function calling. Mixtral looks to be the best however it's also the largest model at 22Gb.
5. All main frameworks used on this project Ollama , CrewAI and Langchain are still under active development.

Appendix 1. - Augmenting with RAG (Retrieval augmented generation)

LLMs tend to confabulate if they don't have the requested information acquired during the training phase. As an example if you ask the main Mistral LM for a meal recipe on which it was not trained with proper data it will make something up. To avoid that, Retrieval Augmented Generation (RAG) is used as a technique where the data (in this case the recipes) are stored in a vector databased and the LLM can read form the database by doing a similarity search.

For this implementation the ChromaDB database and the Nomic-embedding LLM model was used to build the embeddings from the recipe documents on a website. You can pick you favorite cooking website. Obviously this is generic so it can work with any article.

git clone https://github.com/Q-point/Ollama_RAG_web

pip install langchain
pip install langchain-community
pip install gradio

python app.py

The script app.py on the repo above a uses the same pipeline to conduct queries on recipes using replies based on RAG data. This functionality operates independently of the main pipeline at the moment.

Appendix 2. - Adding visual feedback

A Python Flask application that uses the LLM to describe what is sees was implemented as an addon but not integrated with main app.

Appendix 3. - Adding visual feedback

One future addition is to add a third pipeline together with a webapp that generates images based on the provided response using Stable Diffusion. The photo for this project was created using StableDiffusion XL.

CONTAINERS_DIR=/agxorin_ssd/jetson-containers
MODEL_DIR=$CONTAINERS_DIR/data/models/stable-diffusion/models/Stable-diffusion/
sudo chown -R $USER $MODEL_DIR
wget -P $MODEL_DIR https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0/resolve/main/sd_xl_base_1.0.safetensors
wget -P $MODEL_DIR https://huggingface.co/stabilityai/stable-diffusion-xl-refiner-1.0/resolve/main/sd_xl_refiner_1.0.safetensors

./run.sh $(./autotag stable-diffusion-webui)

using the prompt "A cute ultra realistic llama looking at an apple pie"

Summary

In this project a fully local multimodal multi-agent chatbot with conversational capabilities was designed as a distributed application. The chatbot allows the user to converse using natural speech with local ASR models with LLM agents. In addition the chatbot surveys the environment via a camera and uses a visual description of the frame to assist LLM agents by merging the visual prompt with the speech queries.

Imbuing the chatbot with an agentic character that leverages tools to search the internet or call other APIs increases its effectiveness. Finally the outputs from each agent assembly are passed to a text to speech engine to communicate the findings and replies to the user. This project makes use of multiple state of the art libraries which are continuously evolving as of March 2024.