How to Run a ChatGPT-Like LLM on NVIDIA Jetson board

This project that enables you to unleash the full potential of a ChatGPT-like Large Language Model (LLM) on NVIDIA Jetson boards.

AdvancedFull instructions provided5 hours18,324

How to Run a ChatGPT-Like LLM on NVIDIA Jetson board

Things used in this project

Hardware components

Seeed Studio ReSpeaker Mic Array v2.0

Seeed Studio reComputer Jetson-20-1-H2 with Jetson Xavier NX 16 GB module, Aluminium case, pre-installed JetPack System

Story

Language models have revolutionized the field of natural language processing, enabling computers to understand and generate human-like text. One such powerful language model is ChatGPT, developed by OpenAI. There are a number of AI players in the market right now, including ChatGPT, Google Bard, Bing AI Chat, and many more. However, all of them require you to have an internet connection to interact with the AI. Also, there is a growing demand for running similar models on edge devices like single board computers (SBCs) for offline and low-latency applications. I inspired by the groundbreaking work of Nick Bild in his hackster post, who explored the concept of VoiceGPT, a voice assistant that leverages the capabilities of the advanced ChatGPT on Raspberry Pi. In this post, I will be employing the Nvidia Jetson board instead of the Raspberry Pi. The NVIDIA Jetson board, known for its powerful GPU and compact form factor, offers an excellent platform for running sophisticated language models. By running a ChatGPT-like language model on the Nvidia Jetson board, you can benefit from reduced network latency, increased privacy, and the ability to use the model in resource-constrained environments without relying on an internet connection.

So on that note, let’s go ahead and learn how to use an LLM locally without the internet.

Overview

To give a quick overview of what scenario I want to create check out the picture below. To run a ChatGPT-like Language Model (LLM) on an NVIDIA Jetson board, you can follow the hardware diagram outlined below, which includes the Respeaker USB Mic array, a SBC like the NVIDIA Jetson, and Bluetooth speaker.

Proposed system architecture of the NVIDIA Jetson Xavier NX as a system brain for the Voice assistant

With this hardware setup, the microphone array captures audio data from the user, which is then processed by the Jetson board. The Jetson board runs the ChatGPT-like language model, generating a text response to the user's input. The text response is then passed to the Bluetooth speaker, which converts it into speech. The entire operation, from capturing audio input to generating text responses and converting them to speech, is performed locally on the edge. This means that all the necessary computations and processes occur on the NVIDIA Jetson board without requiring an internet connection. It ensures privacy and reduces latency by avoiding the need to send data to a remote server.

In the image provided below, you can observe my demo setup with the Nvidia Jetson board, microphone array and bluetooth speaker.

Machine learning pipeline for Voice assistant

The machine learning pipeline for a voice assistant typically involves several steps to convert spoken language into a meaningful response. Here's a software diagram of the pipeline:

Microphone Input: The pipeline begins by capturing audio input from the user through a microphone.
Wake-up Detection: The captured audio is first analyzed to detect if the user has initiated a wake-up command. If the wake-up command is detected, the voice assistant proceeds to the next stage, otherwise, it remains in a standby state.
Automatic Speech Recognition (ASR): Once the wake-up command is detected, the audio data is passed through an Automatic Speech Recognition system. ASR technology converts spoken language into written text, allowing the voice assistant to understand the user's speech.
Large Language Model: The output from the ASR system, which is the recognized text, is then passed to a chatbot large language model.
Text-to-Speech (TTS): After the chatbot language model generates a response, the text-based output is converted into spoken words using a Text-to-Speech system.

How to Run a Large Language Mode on a Nvidia Jetson board

In this project, we’ll explore the features and capabilities of FastChat repo. To deploy a FastChat model on a Nvidia Jetson Xavier NX board, follow these steps:

Install the Fastchat library using the pip package manager. Execute the following command:

pip3 install fschat

It will automatically download the weights from a Hugging Face. Then run below command:

python3 -m fastchat.serve.cli --model-path lmsys/fastchat-t5-3b-v1.0

Finally, the test video is as follows

The code executes on Jetson board without transferring any data to the cloud. Although the video demonstrates slow performance of the initial model loading and token generation, we will now explore various techniques to enhance the performance of your Jetson board.

Overclocking Jetson Xavier NX's CPU and GPU

Overclocking, basically, is the way to boost your hardware performance by tuning up several device parameters.

Before overclocking

You can enhance the performance of your Jetson Xavier by following the instructions provided in this GitHub repository. The repository contains the necessary information to activate the maximum clock frequencies, which are displayed below.

After overclocking

As we can see, the CPU and GPU frequency values are better than before without overclocking. You can also check the current performance of the Nvidia Jetson boards using jetson-stats utility developed by Raffaello Bonghi.

Putting It All Together

For this project, I integrated Wake-up word detection and Automatic Speech Recognition using the Vosk API. Then, Piper text-to-speech component was utilized. It is implemented using FastAPI, which is a web framework for building APIs quickly. When the chatbot generates a response, the text is sent to the TTS module. FastAPI receives the text data, processes it, and synthesizes it into natural-sounding speech. The generated audio is then returned to the user as a response, allowing the system to interact with the user through voice.

Here is a demonstration video of what the final result looks like.

Overall, this implementation of a voice assistant allow to understand spoken commands and speech, process them using chatbot, and generate spoken responses to provide a conversational experience for users using ChatGPT-like Large Language Model on NVIDIA Jetson boards.

Conclusion

That’s it for today! I explored how to set up and run a ChatGPT-like large language model on the NVIDIA Jetson, enabling you to have conversational AI capabilities locally.

All the code referenced in this story is available in my Github repo.

I hope you found this post useful and thanks for reading it. If you have any questions or feedback, leave a comment below. Stay tuned!