Step 1. Get the model weights
Step 2: Preparation
Step 3: Building of TensorRT-LLM engine
Step 4: Run the inference on the NVIDIA Jetson AGX Orin 64GB Developer Kit
Boosting LLM Inference Speed Using AWQ quantized models
References

Published November 24, 2024 © GPL3+

Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin

In this tutorial, I will deploy Large Language Models (LLMs) using the TensorRT-LLM on the Nvidia Jetson AGX Orin Developer Kit.

AdvancedFull instructions provided4 hours3,122

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Story

The key bottleneck of large language models (LLMs) inference lies in the shortage of GPU memory resources. Thus, a variety of acceleration frameworks primarily emphasize reducing peak GPU memory usage and enhancing GPU utilization. This is where TensorRT-LLM, a game-changer by Nvidia, comes into play. TensorRT-LLM is an open-source library that accelerates and optimizes the inference performance of the LLMs on the NVIDIA AI platform. TensorRT provides highly specialized optimizations for inference on NVIDIA GPUs. See the Github repo for more examples and documentation!

In this guide, I will use the Llama-3.1-8B-Instruct model as an example to demonstrate how to deploy an LLM inference engine using TensorRT-LLM on the NVIDIA Jetson AGX Orin 64GB Developer Kit. NVIDIA Jetson AGX Orin Developer Kit features a unified memory architecture between the Arm-based CPU cores and the NVIDIA Ampere architecture-based GPU. The system has 64 GB of shared memory, which is shared between the CPU and GPU.

Nvidia JetPack 6.1 is the latest production release of JetPack 6. Ensure your Jetson AGX Orin developer kit has been flashed with the latest JetPack 6.1.

Check the current jetpack version using apt show nvidia-jetpack command:

Package: nvidia-jetpack
Version: 6.1+b123
Priority: standard
Section: metapackages
Source: nvidia-jetpack (6.1)
Maintainer: NVIDIA Corporation
Installed-Size: 199 kB
Depends: nvidia-jetpack-runtime (= 6.1+b123), nvidia-jetpack-dev (= 6.1+b123)
Homepage: http://developer.nvidia.com/jetson
Download-Size: 29.3 kB
APT-Sources: https://repo.download.nvidia.com/jetson/common r36.4/main arm64 Packages
Description: NVIDIA Jetpack Meta Package

A pre-built Docker image is available for you, so you can easily get started by following the documentation on the Nvidia Jetson AI Lab page: https://www.jetson-ai-lab.com/tensorrt_llm.html

To increase VRAM, I disabled Desktop GUI on Jetson AGX Orin.

Step 1. Get the model weights

We need to download the weights of the model we will be working with, which is Meta-Llama-3.1-8B-Instruct. Therefore, ensure you have accepted the licensing terms and generated the appropriate HuggingFace tokens to download the model.

To begin, you will need to set up Git LFS by running the following commands in your terminal:

sudo apt-get update && sudo apt-get -y install git-lfs

Install Git LFS by running the following command:

git lfs install

Clone the Llama-3.1-8B model repository using the following command:

git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct

Alternatively, you can use the Hugging Face CLI to download the model. Install the CLI using the following command:

pip install -U "huggingface_hub[cli]"

Use the huggingface-cli login command to authenticate your Hugging Face account. Enter your Hugging Face API KEY.

Download the Llama-3.1-8B-Instruct model using the following command:

huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./Llama-3.1-8B-Instruct

Once the download is complete, you can verify the content of the downloaded folder using the following command:

ls -l ./Llama-3.1-8B-Instruct/
total 15693184
-rw-rw-r-- 1 jetson jetson        826 Nov 16 21:22 config.json
-rw-rw-r-- 1 jetson jetson        185 Nov 16 21:22 generation_config.json
-rw-rw-r-- 1 jetson jetson       7627 Nov 16 21:21 LICENSE
-rw-rw-r-- 1 jetson jetson 4976698672 Nov 16 21:30 model-00001-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson 4999802720 Nov 16 21:30 model-00002-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson 4915916176 Nov 16 21:29 model-00003-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson 1168138808 Nov 16 21:24 model-00004-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson     23950 Nov 16 21:22 model.safetensors.index.json
drwxrwxr-x 2 jetson jetson       4096 Nov 16 21:42 original
-rw-rw-r-- 1 jetson jetson      40883 Nov 16 21:21 README.md
-rw-rw-r-- 1 jetson jetson         73 Nov 16 21:22 special_tokens_map.json
-rw-rw-r-- 1 jetson jetson      50500 Nov 16 21:22 tokenizer_config.json
-rw-rw-r-- 1 jetson jetson    9085658 Nov 16 21:22 tokenizer.json
-rw-rw-r-- 1 jetson jetson       4691 Nov 16 21:22 USE_POLICY.md

Step 2: Preparation

Create a new virtual environment using the venv module to isolate your project dependencies.

python3 -m venv tensorrt-llm

Activate the newly created virtual environment

source tensorrt-llm/bin/activate

Update the package and install the required packages

sudo apt-get update
sudo apt-get install -y python3-pip libopenblas-dev ccache

Download the CuSparseLT installation script:

wget https://raw.githubusercontent.com/pytorch/pytorch/9b424aac1d70f360479dd919d6b7933b5a9181ac/.ci/docker/common/install_cusparselt.sh

Set the CUDA version: Set the CUDA version to 12.6 by running the following command:

export CUDA_VERSION=12.6

Install CuSparseLT by running the following command:

sudo -E bash ./install_cusparselt.sh

This process may take some time to complete. Ensure that the installation finishes successfully.

Finally, install NumPy version 1.26.1 using pip: bash Copy code

python3 -m pip install numpy=='1.26.1'

This step completes the preparation phase, setting up the environment and dependencies required for the next steps.

Step 3: Building of TensorRT-LLM engine

Clone the TensorRT-LLM repository from GitHub using the following command:

git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git checkout v0.12.0-jetson
git lfs pull

Then, run the following command to build a wheel file for TensorRT-LLM:

sudo python3 scripts/build_wheel.py --clean --cuda_architectures 87 -DENABLE_MULTI_DEVICE=0 --build_type Release --benchmarks --use_ccache -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc

This command builds a wheel file for TensorRT-LLM. It can take some time to complete.

You will see an output like the following:

Successfully built tensorrt_llm-0.12.0-cp310-cp310-linux_aarch64.whl

Install the built wheel file using pip:

pip3 install build/tensorrt_llm-*.whl

Expected Output:

Successfully installed tensorrt-llm-0.12.0

Verify the installation by importing the library and printing its version:

python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"

To run these models efficiently on GPU, we must convert them into the TensorRT-LLM format. We will then use the trtllm-build command-line tool to build the optimized TensorRT engine from the Hugging Face checkpoints.

The conversion of the HuggingFace model can be done with a single command:

sudo python /path/to/TensorRT-LLM/examples/llama/convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir /path/to/Llama-3.1-8B-Instruct-convert \
--dtype float16

You will see an output like the following:

[TensorRT-LLM] TensorRT-LLM version: 0.12.0
0.12.0
230it [00:01, 124.05it/s]
Total time of converting checkpoints: 00:00:30

This should produce two files: a model configuration (config.json) and weights (rank0.safetensors). Next, we build the model engine:

sudo trtllm-build \
--checkpoint_dir /path/to/Llama-3.1-8B-Instruct-convert \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--output_dir /path/to/Llama-3.1-8B-Instruct-engine

This command builds the TensorRT engine using the converted checkpoints and stores the result in the specified directory.

If the build is successful, you should see the following output:

[11/17/2024-16:26:23] [TRT-LLM] [I] Build phase peak memory: 32793.48 MB, children: 0.00 MB
[11/17/2024-16:26:23] [TRT-LLM] [I] Serializing engine to /home/jetson/Projects/tensorrtllm/Llama-3.1-8B-final/rank0.engine...
[11/17/2024-16:26:44] [TRT-LLM] [I] Engine serialized. Total time: 00:00:20
[11/17/2024-16:26:45] [TRT-LLM] [I] Total time of building all engines: 00:01:06

This should produce two files: a model configuration (config.json) and weights (rank0.engine).

Step 4: Run the inference on the NVIDIA Jetson AGX Orin 64GB Developer Kit

Once the model engine is built, you can test it by running the model with the following commands:

sudo python3 /path/to/TensorRT-LLM/examples/run.py \
--engine_dir /path/to/Llama-3.1-8B-Instruct-engine \
--max_output_len 100 \
--max_attention_window_size 1024 \
--tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
--input_text "Kazakhstan is" \
--gpu_weights_percent 70 \
--kv_cache_free_gpu_memory_fraction 0.1 \
--num_beams 1

If the model runs successfully, you should see the following output:

The performance of TensorRT LLM can be visibly noticed when the tokens are streamed.

Then we can TensorRT LLM server in OpenAI-compatible mode. Run the following command:

sudo python3 /path/to/TensorRT-LLM/examples/apps/openai_server.py \
/path/to/Llama-3.1-8B-Instruct-engine \
--tokenizer /path/to/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 5001

With the --host option, you allow external connections.

You can test the model's inference by sending a request using the curl command.

curl http://localhost:5001/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "meta-llama/Llama-3.1-8B",
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

If the response is successful, you should see the following output:

{"id":"chatcmpl-869061ee5db04f8ca9f4d0b870c7de51","object":"chat.completion","created":1732093982,"model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"message":{"role":"assistant","content":"The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay","tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":52,"total_tokens":68,"completion_tokens":16}}

You can use client functions similar to OpenAI to call the TensorRT-LLM service.

from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5001/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

chat_response = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Tell me a about Kazakhstan."},
    ]
)
print("Chat response:", chat_response)

If the response is as follows, the deployment is successful.

Chat response: ChatCompletion(id='chatcmpl-b71842ec0407465b9b5ac32130bfd356', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='Kazakhstan is a country located in Central Asia, bordered by Russia to the', refusal=None, role='assistant', function_call=None, tool_calls=[]), stop_reason=None)], created=1732086771, model='meta-llama/Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=48, total_tokens=64, completion_tokens_details=None, prompt_tokens_details=None))

Create a Streamlit Web App to Interact with the TensorRT-LLM Service

import streamlit as st
from openai import OpenAI

st.title("TensorRT-LLM Demo on the NVIDIA Jetson AGX Orin Developer Kit ")

client = OpenAI(base_url="http://localhost:5001/v1", api_key="None")

if "messages" not in st.session_state:
    st.session_state["messages"] = []

prompt = st.chat_input("Say something")
if prompt:
    st.session_state["messages"].append({"role": "user", "content": prompt})
    for message in st.session_state["messages"]:
        st.chat_message(message["role"]).write(message["content"])
    container = st.empty()
    chat_completion = client.chat.completions.create(
        stream=True,
        messages=st.session_state["messages"],
        model="ensemble", 
        max_tokens=512
    )
    response = ""
    for event in chat_completion:
        content = event.choices[0].delta.content
        if content:
            response += content
        container.chat_message("assistant").write(response)
    st.session_state["messages"].append({"role": "assistant", "content": response})

Demo video:

Demo Video: Streamlit Web App for TensorRT-LLM on NVIDIA Jetson AGX Orin Developer Kit

Boosting LLM Inference Speed Using AWQ quantized models

To achieve fast inference, we can utilize AWQ (Activation-aware Weight Quantization), a novel weight-only quantization technique integrated with TensorRT-LLM. For instance, serving a 70B Llama-3 model requires approximately 140GB of memory for FP16 weights, 70GB for INT8, and just 35GB for INT4.

To quantize a model using AWQ, you can utilize the quantize.py script provided in the TensorRT-LLM examples. The following command demonstrates how to quantize a Llama-3.1-8B-Instruct model:

python3 /TensorRT-LLM/examples/quantization/quantize.py \
--model_dir ./Llama-3.1-8B-Instruct
--output_dir ./Llama-3.1-8B-Instruct-quantized-awq \
--dtype float16 \
--qformat int4_awa \
--awq_block_size 128 \
--device cuda \
--batch_size 16

This command quantizes the model using the AWQ technique with INT4 weights and a block size of 128. The resulting quantized model is saved in the specified output directory.

After quantizing the model, you need to build the TensorRT engine using the trtllm-build command:

trtllm-build --checkpoint_dir ./Llama-3.1-8B-Instruct-quantized-awq \
	--output_dir ./Llama-3.1-8B-Instruct-quantized-awq-engine \
	--gemm_plugin auto

To deploy the quantized model, you can use the OpenAI API server. Run the following command to start the server:

python3 /path/to/TensorRT-LLM/examples/apps/openai_server.py \
./Llama-3.1-8B-Instruct-quantized-awq-engine  \
--tokenizer ./Llama-3.1-8B-Instruct 
--host 0.0.0.0 \
--port 5001 \

A demo video demonstration is available, showcasing the improved inference speed of the LLaMA 3.1 8B Instruct model with the AWQ technique.

In this blog post, my goal was to demonstrate how state-of-the-art inference can be achieved using TensorRT LLM on the NVIDIA Jetson AGX Orin 64GB Developer Kit. I covered everything from compiling an LLM to deploying the model in production using Streamlit.

References:

Nurgaliyev Shakhizat

75 projects • 194 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Contact

Comments

Please log in or sign up to comment.

Embed the widget on your own site

Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin

Running LLMs with TensorRT-LLM on Nvidia Jetson AGX Orin

Things used in this project

Hardware components

Story

Step 1. Get the model weights

Step 2: Preparation

Step 3: Building of TensorRT-LLM engine

Step 4: Run the inference on the NVIDIA Jetson AGX Orin 64GB Developer Kit

Boosting LLM Inference Speed Using AWQ quantized models

References:

Credits

Nurgaliyev Shakhizat

Comments

Related channels and tags