NVIDIA's Triton Inference Server is an open-source inference service framework designed to facilitate the rapid development of AI/ML inference applications. This server supports a diverse range of machine learning frameworks as its runtime backend, including TensorRT, TensorFlow, PyTorch, ONNX, and vLLM, among others. Among these tools, vLLM is an important framework that can effectively use models in production environments.
In this guide, I will use the Llama3.1 models as an example to demonstrate how to deploy a large language model (LLM) inference service using the Triton Inference Server and vLLM on the NVIDIA Jetson AGX Orin 64GB Developer Kit.
Triton Inference Server + vLLM BackendTo use Triton, organize your models in a directory structure that Triton can recognize. If needed, you can specify multiple model repositories.
Define the following folder structure in your directory.
{
"model":"meta-llama/Llama-3.1-8B-Instruct",
"disable_log_requests": true,
"gpu_memory_utilization": 0.5,
"enforce_eager": true,
"max_model_len": 4096,
"download_dir": "/app/model_download"
}
Note that the vLLM model can consume up to 90% of the GPU's memory under default settings.
The config.pbtxt file contains the following configuration:
backend: "vllm"
exclude_input_in_output:True
# The usage of device is deferred to the vLLM engine
instance_group [
{
count: 1
kind: KIND_MODEL
}
]
Start the Triton server and point it to your model repository. Here is an example command:
sudo docker run \
--runtime nvidia \
-it \
--rm \
-p 8000:8000 \
-p 8002:8002 \
-p 8001:8001 \
--ipc=host \
--env "HUGGING_FACE_HUB_TOKEN=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" \
-v ./data:/app \
shakhizat/tritonserver_vllm \
tritonserver --model-repository /app/model-repo --model-control-mode=explicit
This docker command start Triton server in explicit mode, allowing you to manage model loading and unloading according to your needs.
If the service starts successfully, you should see the following output:
grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
http_server.cc:4694] "Started HTTPService at 0.0.0.0:8000"
http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
To load the model, use the following command:
curl -X POST localhost:8000/v2/repository/models/Llama-3.1-8B-Instruct/load
This will download and load the model onto the server.
Loading model weights took 14.9888 GB.
After the model service is successfully deployed, you can send HTTP requests to generate text. For example:
curl -X POST localhost:8000/v2/models/Llama-3.1-8B-Instruct/generate -d \
'{
"text_input": "Compose a poem that explains the concept of recursion in programming.",
"parameters": {
"max_tokens": 150,
"stream": false
}
}'
The following output shows an example of the model response:
{
"model_name": "Llama-3.1-8B-Instruct",
"model_version": "1",
"text_output": "Compose a poem that explains the concept of recursion in programming. A recursive function is a function that calls itself until a base case is met. Write a stemming code to calculate the factorial of a given number using recursion.\n\n## Step 1: Understand the concept of recursion in programming\nRecursion is a programming technique where a function calls itself in its own definition. This helps in solving problems that can be broken down into smaller sub-problems of the same type. The process continues until a base case is reached, at which point the solution is returned.\n\n## Step 2: Determine the base case for the recursive function\nThe base case for the factorial calculation using recursion is when the input number is 0 or 1, as the factorial of 0 and 1 is 1.\n\n## Step 3"
}
The model works correctly, and I received the expected response from the LLM. Here are the metrics from Triton:
INFO 11-02 15:54:54 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 9.4 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
If you wish to obtain results in a continuous stream, you can use the /v2/models/model_name/generate_stream interface. To enable streaming, set the "stream": true parameter within your request body.
curl -X POST localhost:8000/v2/models/Llama-3.1-8B-Instruct/generate_stream \
-d '{
"text_input": "What is Triton Inference Server?",
"parameters": {
"stream": true,
"temperature": 0
}
}'
The model successfully generates responses corresponding to the provided input. If executing the cURL command from outside the container, replace "localhost" with the appropriate hostname of your server.
Unload the model using the following cURL command:
curl -X POST localhost:8000/v2/repository/models/Llama-3.1-8B-Instruct/unload
Output
I1102 16:08:15.478521 1 model_lifecycle.cc:624] "successfully unloaded 'Llama-3.1-8B-Instruct' version 1"
The message "successfully unloaded 'Llama-3.1-8B-Instruct' version 1" confirms the model has been unloaded from the Tirton server.
Use quantized modelsvLLM supports AWQ, GPTQ and SqueezeLLM quantized models. By using quantized models with vLLM, you can reduce the size of your models and improve their performance.
To use a quantized model with vLLM, you need to configure the model.json file. Below is an example configuration file:
{
"model":"hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4",
"disable_log_requests": true,
"gpu_memory_utilization": 0.7,
"max_model_len": 2048,
"quantization": "awq",
"enable_chunked_prefill": true,
"enforce_eager": true,
"enable_prefix_caching": true,
"download_dir": "/app/model_download"
}
In this example, we are using the hugging-quants/Meta-Llama-3.1-70B-Instruct-AWQ-INT4 model, which is a quantized version of the Meta-Llama model. We have also set the quantization parameter to "awq", which specifies the quantization scheme used by the model.
To load the model, you can use the following command:
curl -X POST localhost:8000/v2/repository/models/Meta-Llama-3.1-70B-Instruct-AWQ-INT4/load
This command will load the model into memory and prepare it for use.
Shared memory utilization on the NVIDIA Jetson AGX Orin 64GB Developer Kit using jtop tool:
Once the model is loaded, you can use it to generate text. To do this, you can use the following command:
curl -X POST localhost:8000/v2/models/Meta-Llama-3.1-70B-Instruct-AWQ-INT4/generate -d \
'{
"text_input": "Compose a poem that explains the concept of recursion in programming.",
"parameters": {
"max_tokens": 150,
"stream": false
}
}'
Below is an example output:
'{
"text_input": "Compose a poem that explains the concept of recursion in programming.",
"parameters": {
"max_tokens": 150,
"stream": false
}
}'
{"model_name":"Meta-Llama-3.1-70B-Instruct-AWQ-INT4","model_version":"1","text_output":"Compose a poem that explains the concept of recursion in programming. It summarizes how recursion works, using an example of a program that calculates the factorial of a given number.\n\n**Recursion's Refrain**\n\nIn the realm of code, a tale is told\nOf functions that call themselves, young and old\nRecursion's the name, a concept grand\nWhere a method invokes itself, hand in hand\n\nImagine a program, designed with care\nTo calculate the factorial, with numbers to share\nA function named `factorial`, with a single goal\nTo multiply all integers, from a given role\n\nIt takes an integer `n` as its input dear\nAnd calls itself, with a value quite clear\nA recursive call, with `n-1` in tow\nUntil the base case is reached"}
vLLM provides metrics that can be used to monitor the performance of the model. To view the metrics, you can use the following command:
INFO 11-02 16:47:46 metrics.py:349] Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 2.6 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.4%, CPU KV cache usage: 0.0%.
To unload the model, use the following command:
curl -X POST localhost:8000/v2/repository/models/Meta-Llama-3.1-70B-Instruct-AWQ-INT4/unload
So far, the maximum average generation throughput I have been able to achieve is 18.4 tokens/s, using Meta-Llama/Meta-Llama-3.1-8B-Instruct quantized from FP16 down to INT4 with AutoAWQ. Finally, it can be concluded that vLLM can also be combined with the Triton framework as a backend and run on the the NVIDIA Jetson AGX Orin Developer Kit with 64GB of shared memory.
Thanks and kudos to Johnny Núñez Cano and Dustin Franklin for their tremendous work in bringing VLLM support to the Nvidia Jetson Orin device.References:
Comments