Background
Introduction to Microsoft JARVIS
Prerequisites and configuration for Microsoft JARVIS
Deploy Microsoft JARVIS on NVIDIA Jetson hardware
Tips & Tricks
Conclusion

Published July 12, 2023 © MIT

LLM based Multimodal AI w/ Azure Open AI & NVIDIA Jetson

Prompt an Azure hosted LLM instance of ChatGPT to control a variety of GPU accelerated AI inference tasks on a local embedded device.

IntermediateFull instructions provided4 hours1,794

Things used in this project

Hardware components

NVIDIA Jetson Orin 64 GB Developer Kit

Software apps and online services

Microsoft Azure

Hugging Face

Story

Microsoft JARVIS's LLM based Multimodal AI Process

Background

Large language models (LLMs) are natural language processing computer programs that use artificial neural networks to generate text. They can understand and analyze the nuances of human speech, including syntax, semantics and context meaning. With a sufficiently designed prompt (i.e. user supplied instruction), it is possible to direct an LLM to understand an intended task and guide execution of additional AI inference models in order to achieve a desired result.

For example, we could engage an LLM in a chat style interface to perform a task by submitting a prompt like "Count the number of zebras in this image". The LLM may not possess the ability to perform computer vision-based inference on its own, but it could infer our intent to:

1.) Recognize an intended task i.e. object detection (accomplished by exposing the LLM to a variety of example prompts that describe the tasks we are interested in beforehand)

2.) Determine a suitable AI model to execute for a recognized task (accomplished by tagging available models based on their characteristics and scoring them as potential candidates for a given task)

3.) Provide reasoning to perform the determined AI inference task using parameters extracted during task recognition / planning step (with adherence to any specifications required by a given model regarding acceptable input)

4.) Reason on any resulting text output from the actual AI inference task execution to provide additional context with which to address the original question (we can tell the LLM what the additional AI task has concluded which may now allow it to answer the original question)

This process of orchestrating execution of multiple AI models to achieve a desired result is known as Multimodal AI. In this article, we will explore the possibilities of Multimodal AI by running the open source JARVIS project from Microsoft on an NVIDIA Jetson Orin embedded device.

For reproducing the content in this article, we advise to use a 64 GB Jetson Orin device equipped with a 1TB or greater NVME storage module.

The Jetson Orin 64 GB dev kit at a size of 110mm x 110mm x 71.65mm (about the size of a hamburger)

This small form-factor device will allow us to consult a privately hosted instance of ChatGPT on Azure Open AI Service to locally perform various AI tasks (including Generative AI based video / image generation).

These include capabilities like: ability to generate completely new images based on the pose structure of a base image, produce video and audio from text descriptions, and allow us to answer questions like "Count the number of zebras in this image".

Use the pose of an image to create an image of an astronaut riding a horse, then make a movie about it

Count the number of zebras in an image

Note: It is highly recommended to proceed in reproducing this content by using an NVIDIA Jetson Orin 64 GB developer kit, the 32 GB model can sufficiently run most tasks in a standard configuration but is likely to experience out of memory errors when running in a full configuration. If you are attempting this on less resource heavey Jetson embedded devices (Jetson Nano, TX2, Orin-NX etc.), it is advisable to proceed with a lite configuration as described in the sections below..

Introduction to Microsoft JARVIS

Microsoft JARVIS demonstrates that natural language can serve as an interface for LLMs to connect numerous AI models together for solving complicated AI tasks. The project creators have released a very in-depth research article on arxiv.org that covers this methodology in detail. The JARVIS project involves a number of technical components to accomplish it's goals and we'll talk about those here in detail, and hopefully leave you interested in taking it for a spin!

The application makes use of a server component which provides locally downloaded or remotely hosted models from huggingface.co, and is defined in server/models_server.py. The server's behavior is controlled by an associated configuration file (which we will set up in the next section) to determine which models are available. Model availability is based on a combination of the inference_mode (local --all tasks are run locally, huggingface --all tasks are executed as web requests to huggingface.co, or hybrid --a combination of the first two options) and local_deployment parameters (minimal, standard, full --each option gives access to a larger amount of possible models but requires more resources as you approach full). This component ultimately processes web requests to initiate model execution on the NVIDIA Jetson device. Getting familiar with this particular part of the application will give you clarity on what models you can expect to be available for a given configuration and is a great entry-point for extending JARVIS to support new models.

The server component is paired with a web app that provides a web-based chat-style user interface. This app has it's own configuration file for determining the route to the aforementioned models server component (defined in HUGGINGGPT_BASE_URL). The web interface passes requests to an intermediary component awesome_chat.py which translates input into model selection requests to the models server and delivers model server results back to the web app in a format that allows it to properly render responses that contain image / video / text etc.

These three components ultimately drive the innovative multimodal chat experience that you see in the demo gifs. As you progress in this article, we will cover how to set up and execute each of these components and show you how to get a working JARVIS instance configured interface with an LLM via ChatGPT on Azure Open AI Service all running on an NVIDIA Jetson Orin 64 GB developer kit.

Prerequisites and configuration for Microsoft JARVIS

For this section, it is assumed that you have already set up and flashed a recent JetPack OS to your NVIDIA Jetson embedded device. It is recommended to work with the Jetson device via a keyboard / mouse / monitor setup, but you are welcome to do these things over a remote terminal if you feel comfortable there.

From a terminal on your Jetson device, you'll want to perform the following to grab the source files for Microsoft JARVIS and begin the process of downloading the huggingface models to the device:

sudo apt update
sudo apt install git-lfs
cd ~
mkdir src
cd src
git clone https://github.com/microsoft/JARVIS.git
cd JARVIS/server/models
bash download.sh

Note: this process could take a while depending on internet speed. At time of writing, 28 models are included in JARVIS/server/models/download.sh and have a total size of 266 GB on disk! For this reason, it is highly recommended to configure your Jetson device to use a suitable capacity NVME based storage as the default EMMC only provides 64 GB storage.

Once the models have completed downloading, we are ready to begin modifying the model server configuration. To do this, use your favorite text editor to take a look at the configuration file in ~/src/JARVIS/server/configs/config.defaul.yaml We'll need to obtain values for the azure/api_key, azure/base_url, azure/deployment_name, and huggingface/token parameters.

Note: There are other config examples in this folder that may be more suitable to your preference, but for purposes of this article, we assume that you are modifying the file in ~/src/JARVIS/server/configs/config.defaul.yaml

To supply the parameters for the azure section of the config, we'll need to deploy an instance of Azure Open AI into an active Azure Subscription. You may be able to proceed with a free Azure trial account if you do not have one already.

Follow the instructions in How-to - Create a resource and deploy a model using Azure OpenAI Service to setup your Azure Open AI service. In the steps to "Deploy A Model", ensure that you deploy the gpt-35-turbo model id (This step is important!) and take note of the name used for the deployment as we will use this in the next steps.

Once the model has been deployed, navigate to your Azure Open AI Service deployment in the Azure portal by searching "Azure Open AI", and choose the recently deployed service as shown in these 2 steps:

On the resulting screen, select "Keys and Endpoint" along the left-hand side to view your Access Keys and Endpoint. You will need these values readily available for the next steps.

Now that we've configured the Azure Open AI service, we need to configure an API token for huggingface.co. Follow the steps in this article, specifically the section "Get your API Token" and keep this value on-hand for the next step.

Next, on the Jetson device, open the file at: ~/src/JARVIS/server/configs/config.default.yaml in your favorite text editor. Once opened, update following settings as shown (leave everything else as is):

# openai: 
  # api_key: REPLACE_WITH_YOUR_OPENAI_API_KEY_HERE
azure:
  api_key: REPLACE_WITH_YOUR_AZURE_API_KEY_HERE
  base_url: REPLACE_WITH_YOUR_ENDPOINT_HERE
  deployment_name: REPLACE_WITH_YOUR_DEPLOYMENT_NAME_HERE
huggingface:
  token: REPLACE_WITH_YOUR_HUGGINGFACE_TOKEN_HERE
model: gpt-3.5-turbo
inference_mode: local
local_deployment: full

Note: We are commenting out the openai settings (addition of #) and uncommenting the azure configuration section (removal of #).

Deploy Microsoft JARVIS on NVIDIA Jetson hardware

We are now ready for deployment of JARVIS on the NVIDIA Jetson device. You may be wondering, how exactly are we going to take advantage of GPU acceleration during the execution of our models, specifically on an NVIDIA Jetson embedded device?

The solution will involve two key strategies.

1. We need to ensure that GPU acceleration is available to the Docker instance running on the Jetson embedded device so that our container image can make use of the onboard GPU hardware. The good news is that the JetPack OS ships with the NVIDIA docker runtime which allows us to make GPU devices available to Docker. The bad news is that it is not enabled by default.

2. To achieve acceleration during execution of the HuggingFace models, we need Jetson compatible versions of pytorch, torchvision, and torchaudio as these libraries handle the underlying execution of GPU accelerated functions inherent to these models. The bad news is that these libraries aren't readily available in the default python package channels. The good news is that NVIDIA offers a pre-built container image (NVIDIA L4T PyTorch) that contains Jetson compatible versions of these packages. We will use this in two ways, by building a custom container with all dependencies for JARVIS pre-installed based on the NVIDIA L4T PyTorch container, we can ship the JARVIS project with everything needed to run the JARVIS example applications and have accelerated support for HuggingFace models. I will supply a pre-built image that you can just download and run in the next steps, but if you are curious how this is built, you can refer to the Docker file used to build this image.

With these concepts in mind, we are now ready to begin deployment of JARVIS to our Jetson device. To begin, ensure that the Docker runtime is set to default to "nvidia" on your Jetson device. Do this by opening /etc/docker/daemon.json in your favorite text editor and modifying the contents it as shown:

{
    "runtimes": {
        "nvidia": {
            "path": "/usr/bin/nvidia-container-runtime",
            "runtimeArgs": []
         } 
    },
    "default-runtime": "nvidia" 
}

Once you have saved this modification, restart the docker service with:

sudo systemctl restart docker

Next execute the following command to start the an instance of JARVIS in Docker:

docker run --name jarvis --net=host --gpus all -v ~/src/JARVIS/server/configs:/app/server/configs -v ~/src/JARVIS/server/models:/app/server/models toolboc/nv-jarvis:r35.2.1

This step will automatically start the models_server.py with the configuration located at ~/src/JARVIS/server/configs/config.default.yaml

Note: You may notice that this syntax assumes that you have configs located in ~/src/JARVIS/server/configs on the host file system, and model files located in ~/src/JARVIS/server/models on the host file system. If you followed the instructions up to this point exactly as described, this should not require any modification. The -v parameter will mount these paths into the docker container at runtime.

In a second terminal window, start the awesome_chat.py service that will bridge the web service to the models server:

docker exec jarvis python3 awesome_chat.py --config configs/config.default.yaml --mode server

Note: This step works by attaching to the named "jarvis" instance and executing the command inside the running container. If the command fails with "ValueError: The server of local inference endpoints is not running, please start it first. (or using `inference_mode: huggingface` in configs/config.default.yaml for a feature-limited experience)", you may need to allow more time for the model_server to finish starting up.

Finally, in a third terminal window start up the web application with:

docker exec jarvis npm run dev --prefix=/app/web

This will start a locally running web server at http://localhost:9999. You should now be able to visit this url in a web browser on the Jetson device and begin interacting with the Multimodal AI chat application.

Here are some ideas to get started with:

Given a collection of image A: /examples/a.jpg, B: /examples/b.jpg, C: /examples/c.jpg, please tell me how many zebras in these pictures?"
"Please generate a canny image based on /examples/f.jpg"
"show me a joke and an image of cat"
"what is in /examples/a.jpg"
"generate a video and audio about a dog is running on the grass"
"based on the /examples/a.jpg, please generate a video and audio"
"based on pose of /examples/d.jpg and content of /examples/e.jpg, please show me a new image"

Tips & Tricks

Microsoft JARVIS is fun to use, but you may be seeking to demo this application to others, or seek better performance when performing AI tasks. Below are a few tricks that you can use to enhance the overall experience.

1. Enable remote access from another networked device to the Multimodal AI chat application. (i.e. access the web app running on the Jetson from a browser on a second laptop)

This can be accomplished by modifying the value of HUGGINGGPT_BASE_URL in JARVIS/web/src/config/index.ts to use the ip address of the Jetson device instead of localhost.

2. Text2Video is awesome but sometimes my device runs out memory during inference, how can you make it more reliable?

You can reduce the resources used for text2video by modifying this line in JARVIS/server/models_server.py

video_frames = pipe(prompt, num_inference_steps=50, num_frames=40).frames

Reducing either num_inference_steps or num_frames parameters will see performance boost at the expense of quality.

You may also find that the text2video process just doesn't have enough time to finish. You can also try increasing the timeout in this line in JARVIS/web/src/api/hugginggpt.ts

timeout: 180000, // 180 seconds

3. I want to totally maximize the available memory and CPU on my jetson device, to do this I want to disable the GUI completely on the device and ensure that no programs are running other than JARVIS.

You can accomplish this by initiating system run level 3 which will disable the desktop environment and free up the maximum amount of resources on your device while still allowing for remote ssh. To disable the GUI, run the following on the Jetson device:

sudo init 3

To undo this, you can run the following to re-enable the desktop environment:

sudo init 5

Conclusion

Multimodal AI is the unification of various domains of AI to accomplish tasks beyond the capacity of any single model. By orchestrating various inference tasks, we demonstrate how AI can be chained to create truly intelligent interactions that get us closer to human-like behavior. When we supply the latest innovations in LLMs with the ability to direct and make sense of vision and audio inferencing, you start to see how AI can truly come to life with a Multimodal approach.

While this project demonstrates the possible, it becomes clear that the reality of producing truly intelligent systems that can think understand, see, hear, and speak is right around the corner. In fact, we see similarities in the recent GPT-4 release which uses a Mixture of Experts approach to solve various tasks with domain specific models to achieve higher levels of human-like problem solving.

With the NVIDIA Jetson embedded hardware, it is remarkable that we can work with similar types of functionalities on a small form-factor, local device. It is my belief that unlocking large model functionality to direct local execution on these types of devices will be the next wave of AIoT and emergence of mainstream AI in daily applications.

Until next time, Happy Hacking!