The key bottleneck of large language models (LLMs) inference lies in the shortage of GPU memory resources. Thus, a variety of acceleration frameworks primarily emphasize reducing peak GPU memory usage and enhancing GPU utilization. This is where TensorRT-LLM, a game-changer by Nvidia, comes into play. TensorRT-LLM is an open-source library that accelerates and optimizes the inference performance of the LLMs on the NVIDIA AI platform. TensorRT provides highly specialized optimizations for inference on NVIDIA GPUs. See the Github repo for more examples and documentation!
In this guide, I will use the Llama-3.1-8B-Instruct model as an example to demonstrate how to deploy an LLM inference engine using TensorRT-LLM on the NVIDIA Jetson AGX Orin 64GB Developer Kit. NVIDIA Jetson AGX Orin Developer Kit features a unified memory architecture between the Arm-based CPU cores and the NVIDIA Ampere architecture-based GPU. The system has 64 GB of shared memory, which is shared between the CPU and GPU.
Nvidia JetPack 6.1 is the latest production release of JetPack 6. Ensure your Jetson AGX Orin developer kit has been flashed with the latest JetPack 6.1.
Check the current jetpack version using apt show nvidia-jetpack command:
Package: nvidia-jetpack
Version: 6.1+b123
Priority: standard
Section: metapackages
Source: nvidia-jetpack (6.1)
Maintainer: NVIDIA Corporation
Installed-Size: 199 kB
Depends: nvidia-jetpack-runtime (= 6.1+b123), nvidia-jetpack-dev (= 6.1+b123)
Homepage: http://developer.nvidia.com/jetson
Download-Size: 29.3 kB
APT-Sources: https://repo.download.nvidia.com/jetson/common r36.4/main arm64 Packages
Description: NVIDIA Jetpack Meta Package
A pre-built Docker image is available for you, so you can easily get started by following the documentation on the Nvidia Jetson AI Lab page: https://www.jetson-ai-lab.com/tensorrt_llm.html
To increase VRAM, I disabled Desktop GUI on Jetson AGX Orin.
Step 1. Get the model weightsWe need to download the weights of the model we will be working with, which is Meta-Llama-3.1-8B-Instruct. Therefore, ensure you have accepted the licensing terms and generated the appropriate HuggingFace tokens to download the model.
To begin, you will need to set up Git LFS by running the following commands in your terminal:
sudo apt-get update && sudo apt-get -y install git-lfs
Install Git LFS by running the following command:
git lfs install
Clone the Llama-3.1-8B model repository using the following command:
git clone https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct
Alternatively, you can use the Hugging Face CLI to download the model. Install the CLI using the following command:
pip install -U "huggingface_hub[cli]"
Use the huggingface-cli login command to authenticate your Hugging Face account. Enter your Hugging Face API KEY.
Download the Llama-3.1-8B-Instruct model using the following command:
huggingface-cli download meta-llama/Llama-3.1-8B-Instruct --local-dir ./Llama-3.1-8B-Instruct
Once the download is complete, you can verify the content of the downloaded folder using the following command:
ls -l ./Llama-3.1-8B-Instruct/
total 15693184
-rw-rw-r-- 1 jetson jetson 826 Nov 16 21:22 config.json
-rw-rw-r-- 1 jetson jetson 185 Nov 16 21:22 generation_config.json
-rw-rw-r-- 1 jetson jetson 7627 Nov 16 21:21 LICENSE
-rw-rw-r-- 1 jetson jetson 4976698672 Nov 16 21:30 model-00001-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson 4999802720 Nov 16 21:30 model-00002-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson 4915916176 Nov 16 21:29 model-00003-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson 1168138808 Nov 16 21:24 model-00004-of-00004.safetensors
-rw-rw-r-- 1 jetson jetson 23950 Nov 16 21:22 model.safetensors.index.json
drwxrwxr-x 2 jetson jetson 4096 Nov 16 21:42 original
-rw-rw-r-- 1 jetson jetson 40883 Nov 16 21:21 README.md
-rw-rw-r-- 1 jetson jetson 73 Nov 16 21:22 special_tokens_map.json
-rw-rw-r-- 1 jetson jetson 50500 Nov 16 21:22 tokenizer_config.json
-rw-rw-r-- 1 jetson jetson 9085658 Nov 16 21:22 tokenizer.json
-rw-rw-r-- 1 jetson jetson 4691 Nov 16 21:22 USE_POLICY.md
Step 2: PreparationCreate a new virtual environment using the venv module to isolate your project dependencies.
python3 -m venv tensorrt-llm
Activate the newly created virtual environment
source tensorrt-llm/bin/activate
Update the package and install the required packages
sudo apt-get update
sudo apt-get install -y python3-pip libopenblas-dev ccache
Download the CuSparseLT installation script:
wget https://raw.githubusercontent.com/pytorch/pytorch/9b424aac1d70f360479dd919d6b7933b5a9181ac/.ci/docker/common/install_cusparselt.sh
Set the CUDA version: Set the CUDA version to 12.6 by running the following command:
export CUDA_VERSION=12.6
Install CuSparseLT by running the following command:
sudo -E bash ./install_cusparselt.sh
This process may take some time to complete. Ensure that the installation finishes successfully.
Finally, install NumPy version 1.26.1 using pip: bash Copy code
python3 -m pip install numpy=='1.26.1'
This step completes the preparation phase, setting up the environment and dependencies required for the next steps.
Step 3: Building of TensorRT-LLM engineClone the TensorRT-LLM repository from GitHub using the following command:
git clone https://github.com/NVIDIA/TensorRT-LLM.git
cd TensorRT-LLM
git checkout v0.12.0-jetson
git lfs pull
Then, run the following command to build a wheel file for TensorRT-LLM:
sudo python3 scripts/build_wheel.py --clean --cuda_architectures 87 -DENABLE_MULTI_DEVICE=0 --build_type Release --benchmarks --use_ccache -DCMAKE_CUDA_COMPILER=/usr/local/cuda-12.6/bin/nvcc
This command builds a wheel file for TensorRT-LLM. It can take some time to complete.
You will see an output like the following:
Successfully built tensorrt_llm-0.12.0-cp310-cp310-linux_aarch64.whl
Install the built wheel file using pip:
pip3 install build/tensorrt_llm-*.whl
Expected Output:
Successfully installed tensorrt-llm-0.12.0
Verify the installation by importing the library and printing its version:
python3 -c "import tensorrt_llm; print(tensorrt_llm.__version__)"
To run these models efficiently on GPU, we must convert them into the TensorRT-LLM format. We will then use the trtllm-build command-line tool to build the optimized TensorRT engine from the Hugging Face checkpoints.
The conversion of the HuggingFace model can be done with a single command:
sudo python /path/to/TensorRT-LLM/examples/llama/convert_checkpoint.py \
--model_dir /path/to/Llama-3.1-8B-Instruct \
--output_dir /path/to/Llama-3.1-8B-Instruct-convert \
--dtype float16
You will see an output like the following:
[TensorRT-LLM] TensorRT-LLM version: 0.12.0
0.12.0
230it [00:01, 124.05it/s]
Total time of converting checkpoints: 00:00:30
This should produce two files: a model configuration (config.json) and weights (rank0.safetensors). Next, we build the model engine:
sudo trtllm-build \
--checkpoint_dir /path/to/Llama-3.1-8B-Instruct-convert \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--output_dir /path/to/Llama-3.1-8B-Instruct-engine
This command builds the TensorRT engine using the converted checkpoints and stores the result in the specified directory.
If the build is successful, you should see the following output:
[11/17/2024-16:26:23] [TRT-LLM] [I] Build phase peak memory: 32793.48 MB, children: 0.00 MB
[11/17/2024-16:26:23] [TRT-LLM] [I] Serializing engine to /home/jetson/Projects/tensorrtllm/Llama-3.1-8B-final/rank0.engine...
[11/17/2024-16:26:44] [TRT-LLM] [I] Engine serialized. Total time: 00:00:20
[11/17/2024-16:26:45] [TRT-LLM] [I] Total time of building all engines: 00:01:06
This should produce two files: a model configuration (config.json) and weights (rank0.engine).
Step 4: Run the inference on the NVIDIA Jetson AGX Orin 64GB Developer KitOnce the model engine is built, you can test it by running the model with the following commands:
sudo python3 /path/to/TensorRT-LLM/examples/run.py \
--engine_dir /path/to/Llama-3.1-8B-Instruct-engine \
--max_output_len 100 \
--max_attention_window_size 1024 \
--tokenizer_dir /path/to/Llama-3.1-8B-Instruct \
--input_text "Kazakhstan is" \
--gpu_weights_percent 70 \
--kv_cache_free_gpu_memory_fraction 0.1 \
--num_beams 1
If the model runs successfully, you should see the following output:
The performance of TensorRT LLM can be visibly noticed when the tokens are streamed.
Then we can TensorRT LLM server in OpenAI-compatible mode. Run the following command:
sudo python3 /path/to/TensorRT-LLM/examples/apps/openai_server.py \
/path/to/Llama-3.1-8B-Instruct-engine \
--tokenizer /path/to/Llama-3.1-8B-Instruct \
--host 0.0.0.0 \
--port 5001
With the --host option, you allow external connections.
You can test the model's inference by sending a request using the curl command.
curl http://localhost:5001/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
If the response is successful, you should see the following output:
{"id":"chatcmpl-869061ee5db04f8ca9f4d0b870c7de51","object":"chat.completion","created":1732093982,"model":"meta-llama/Llama-3.1-8B","choices":[{"index":0,"message":{"role":"assistant","content":"The Los Angeles Dodgers won the 2020 World Series, defeating the Tampa Bay","tool_calls":[]},"logprobs":null,"finish_reason":null,"stop_reason":null}],"usage":{"prompt_tokens":52,"total_tokens":68,"completion_tokens":16}}
You can use client functions similar to OpenAI to call the TensorRT-LLM service.
from openai import OpenAI
openai_api_key = "EMPTY"
openai_api_base = "http://localhost:5001/v1"
client = OpenAI(
api_key=openai_api_key,
base_url=openai_api_base,
)
chat_response = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a about Kazakhstan."},
]
)
print("Chat response:", chat_response)
If the response is as follows, the deployment is successful.
Chat response: ChatCompletion(id='chatcmpl-b71842ec0407465b9b5ac32130bfd356', choices=[Choice(finish_reason=None, index=0, logprobs=None, message=ChatCompletionMessage(content='Kazakhstan is a country located in Central Asia, bordered by Russia to the', refusal=None, role='assistant', function_call=None, tool_calls=[]), stop_reason=None)], created=1732086771, model='meta-llama/Llama-3.1-8B-Instruct', object='chat.completion', service_tier=None, system_fingerprint=None, usage=CompletionUsage(completion_tokens=16, prompt_tokens=48, total_tokens=64, completion_tokens_details=None, prompt_tokens_details=None))
Create a Streamlit Web App to Interact with the TensorRT-LLM Service
import streamlit as st
from openai import OpenAI
st.title("TensorRT-LLM Demo on the NVIDIA Jetson AGX Orin Developer Kit ")
client = OpenAI(base_url="http://localhost:5001/v1", api_key="None")
if "messages" not in st.session_state:
st.session_state["messages"] = []
prompt = st.chat_input("Say something")
if prompt:
st.session_state["messages"].append({"role": "user", "content": prompt})
for message in st.session_state["messages"]:
st.chat_message(message["role"]).write(message["content"])
container = st.empty()
chat_completion = client.chat.completions.create(
stream=True,
messages=st.session_state["messages"],
model="ensemble",
max_tokens=512
)
response = ""
for event in chat_completion:
content = event.choices[0].delta.content
if content:
response += content
container.chat_message("assistant").write(response)
st.session_state["messages"].append({"role": "assistant", "content": response})
Demo video:
In this blog post, my goal was to demonstrate how state-of-the-art inference can be achieved using TensorRT LLM on the NVIDIA Jetson AGX Orin 64GB Developer Kit. I covered everything from compiling an LLM to deploying the model in production using Streamlit.
References:
Comments