LLMs, or large language models, have a wide range of applications across various fields due to their ability to generate human-like text. Deploying them effectively in real-world scenarios presents unique challenges. These models demand significant computational resources, seamless scalability, and efficient traffic management to meet the demands of production environments. This is where Kubernetes becomes essential. Lightweight Kubernetes distributions are becoming increasingly popular for local development, such as K3s, K3d, Kind, Minikube, and MicroK8s are particularly useful in this context. These tools serve a similar purpose. They are managing Kubernetes on a single machine.
In this blog post, you’ll learn how to get started with running an LLM on the NVIDIA Jetson AGX Orin developer kit using k3s Kubernetes cluster. K3s is an official Cloud Native Computing Foundation sandbox project that brings a lightweight, fully compliant Kubernetes distribution designed for use in production environments with limited resources, optimized for ARM64, making it an ideal choice for small-scale and edge deployments like the NVIDIA Jetson AGX Orin developer kit.
The following diagram shows the target infrastructure that you will obtain after following this blog post.
I believe K3s makes it easier to enable GPU passthrough into the cluster, as it likely only needs to rely on Nvidia CUDA and Docker. To achieve this, the default Docker runtime at /etc/docker/daemon.json
requires modification. Installation instructions for Docker with NVIDIA runtime support are available at the NVIDIA Jetson AI Lab:
Before beginning this tutorial, ensure you have a basic knowledge of Kubernetes. Familiarity with kubectl, deployments, services, and pods is a must.
Install K3s with Docker RuntimeRun the following command to install K3s using Docker as the container runtime and make the kubeconfig file readable:
curl -sfL https://get.k3s.io | INSTALL_K3S_EXEC="--docker --write-kubeconfig-mode 644" sh -
And done! You have a Kubernetes cluster installed locally. With just one line, you should have a K3s cluster running (with load balancer, ingress, storage class, coreDNS, etc.). This will also install kubectl automatically on the NVIDIA Jetson AGX Orin.
Once the installation is complete, you can verify the cluster's status by running the following command:
kubectl get nodes
Output should be a list of cluster nodes, currently just one, or run:
kubectl get pods
To manage Kubernetes applications effectively, you can install Helm, a package manager for Kubernetes. Create a shell script (e.g.,install_helm.sh
) with the following content:
#!/bin/bash
set -e
helm_exists() {
which helm > /dev/null 2>&1
}
# Skip if already installed helm
if helm_exists; then
echo "Helm is installed"
exit 0
fi
# Install Helm
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
# Test helm installation
if helm_exists; then
echo "Helm is successfully installed"
else
echo "Helm installation failed"
exit 1
fi
Save this script and execute it using the command:
chmod +x install_helm.sh
./install_helm.sh
To verify the Kubernetes environment and potentially access GPU resources within a pod, create a YAML file named test.yaml
with the following specifications:
apiVersion: v1
kind: Pod
metadata:
name: torch
spec:
containers:
- name: torchtest
image: dustynv/pytorch:2.6-r36.4.0-cu128-24.04
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
Apply this configuration to your Kubernetes cluster using the command:
kubectl apply -f test.yaml
Once the pod transitions to a Running
status, you can gain interactive shell access to the container to execute Python code with CUDA support:
kubectl exec -it torch -- python3
Then, run the following commands to verify the GPU is recognized:
>>> import torch
>>> torch.cuda.get_device_name(0)
'Orin'
>>>
The output Orin
confirms that the PyTorch environment within the container can successfully detect and utilize the NVIDIA Jetson AGX Orin's GPU. Now that the cluster is created, you can deploy the Kubernetes Pod that runs vLLM.
vLLM Production-stack is an open-source reference implementation of an inference stack built on top of vLLM. We’ll create a Kubernetes deployment and service to manage and expose the LLM application.
First of all, we need to create a custom Docker image for the vLLM router specifically designed for the arm64 architecture of the NVIDIA Jetson AGX Orin Developer Kit.
Create a Dockerfile with the following content:
FROM python:3.12-alpine
# Set environment variables
ENV LANG=C.UTF-8
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONUNBUFFERED=1
# Install necessary system dependencies
RUN apk add --no-cache \
build-base \
git \
openssl-dev \
&& pip install --upgrade pip
# Set working directory
WORKDIR /app
# Clone the vLLM production stack repository
RUN git clone https://github.com/vllm-project/production-stack.git .
# Install the router component in editable mode
RUN pip install --no-cache-dir -e .
# Set the entrypoint
ENTRYPOINT ["vllm-router"]
Navigate to the directory containing the Dockerfile and execute the following command in your terminal in order to build the Docker image:
docker buildx build --platform linux/arm64 -t vllm-router-arm64 .
The resulting Docker image will be tagged as vllm-router-arm64
and can be used to deploy the vLLM router on your NVIDIA Jetson AGX Orin. Finally, push it to a container registry (such as Docker Hub).
Then, clone the vLLM production stack repository:
git clone https://github.com/vllm-project/production-stack.git
Add the vLLM Helm repository:
helm repo add vllm https://vllm-project.github.io/production-stack
Luckily, there’s already a container image for vLLM that exists, so you don’t have to worry about building out a Dockerfile yourself. The container images on the docker hub: https://hub.docker.com/r/dustynv/vllm or https://hub.docker.com/repository/docker/johnnync/vllm
When deploying the vLLM production stack using Helm, you'll need to modify the values.yaml
file to use your custom ARM64 router image. Locate the routerSpec section within your values.yaml file (in tutorials/assets/values-02-basic-config.yaml
or your custom configuration) and update it as follows:
servingEngineSpec:
runtimeClassName: ""
modelSpec:
- name: "llama3"
repository: "dustynv/vllm"
tag: "0.8.3-r36.4.0-cu128-24.04"
modelURL: "meta-llama/Llama-3.2-1B-Instruct"
replicaCount: 2
requestCPU: 2
requestMemory: "64Mi"
pvcStorage: "50Gi"
pvcAccessMode:
- ReadWriteOnce
vllmConfig:
enableChunkedPrefill: false
enablePrefixCaching: false
maxModelLen: 4096
dtype: "bfloat16"
extraArgs: ["--disable-log-requests", "--gpu-memory-utilization", "0.5"]
hf_token: YOUR_HF_TOKEN
routerSpec:
# -- The docker image of the router.
repository: "shahizat005/vllm-router-arm64"
tag: "latest"
imagePullPolicy: "Always"
We are going to deploy an LLM model (meta-llama/Llama-3.2-1B-Instruct
) on the Kubernetes cluster using vLLM.
Install the vLLM stack using Helm:
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml
helm install vllm vllm/vllm-stack -f tutorials/assets/values-02-basic-config.yaml
You should see the output indicating the successful deployment of the Helm chart:
NAME: vllm
LAST DEPLOYED: Mon Apr 14 21:47:09 2025
NAMESPACE: default
STATUS: deployed
REVISION: 1
TEST SUITE: None
The Kubernetes deployment with two replicas is now deployed, so you can start testing to see if it works.
Check the status of the pods in your Kubernetes cluster:
kubectl get pods
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-77bc5f9d67-2hqjx 1/1 Running 0 3m29s
vllm-llama3-deployment-vllm-555ff49459-v94h9 1/1 Running 0 3m29s
vllm-llama3-deployment-vllm-555ff49459-vb62h 1/1 Running 0 3m29s
You should see pods for the vLLM router and the serving engine in a running state. You can inspect the logs of the router pod to ensure it's running correctly and serving API requests:
INFO 04-14 22:17:08 [api_server.py:1078] Starting vLLM API server on http://0.0.0.0:8000
INFO 04-14 22:17:08 [launcher.py:26] Available routes are:
INFO 04-14 22:17:08 [launcher.py:34] Route: /openapi.json, Methods: GET, HEAD
INFO 04-14 22:17:08 [launcher.py:34] Route: /docs, Methods: GET, HEAD
INFO 04-14 22:17:08 [launcher.py:34] Route: /docs/oauth2-redirect, Methods: GET, HEAD
INFO 04-14 22:17:08 [launcher.py:34] Route: /redoc, Methods: GET, HEAD
INFO 04-14 22:17:08 [launcher.py:34] Route: /health, Methods: GET
INFO 04-14 22:17:08 [launcher.py:34] Route: /load, Methods: GET
INFO 04-14 22:17:08 [launcher.py:34] Route: /ping, Methods: GET, POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /tokenize, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /detokenize, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /v1/models, Methods: GET
INFO 04-14 22:17:08 [launcher.py:34] Route: /version, Methods: GET
INFO 04-14 22:17:08 [launcher.py:34] Route: /v1/chat/completions, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /v1/completions, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /v1/embeddings, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /pooling, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /score, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /v1/score, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /v1/audio/transcriptions, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /rerank, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /v1/rerank, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /v2/rerank, Methods: POST
INFO 04-14 22:17:08 [launcher.py:34] Route: /invocations, Methods: POST
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: 10.42.0.1:40232 - "GET /health HTTP/1.1" 200 OK
INFO: 10.42.0.148:38246 - "GET /v1/models HTTP/1.1" 200 OK
INFO: 10.42.0.148:36326 - "GET /metrics HTTP/1.1" 200 OK
INFO 04-14 22:17:19 [loggers.py:87] Engine 000: Avg prompt throughput: 0.0 tokens/s, Avg generation throughput: 0.0 tokens/s, Running: 0 reqs, Waiting: 0 reqs, GPU KV cache usage: 0.0%, Prefix cache hit rate: 0.0%
INFO: 10.42.0.1:40672 - "GET /health HTTP/1.1" 200 OK
To access the vLLM router from outside the Kubernetes cluster, might need to expose the service using port forwarding or a load balancer. The example shows port forwarding:
kubectl port-forward svc/vllm-router-service 30080:80
You can now send requests to the vLLM router API.
curl -o- http://localhost:30080/v1/models
You should receive a JSON response listing the available models served by the vLLM stack:
{
"object": "list",
"data": [
{
"id": "meta-llama/Llama-3.2-1B-Instruct",
"object": "model",
"created": 1744669680,
"owned_by": "vllm",
"root": null
}
]
}
After deploying, you can send a test request like this:
curl -X POST http://localhost:30080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.2-1B-Instruct",
"prompt": "What Is Quantum Computing?",
"max_tokens": 200
}'
The following output shows an example of the model response:
{
"id": "cmpl-f9c1b17aafba48a7b293472a7c8720cf",
"object": "text_completion",
"created": 1744669835,
"model": "meta-llama/Llama-3.2-1B-Instruct",
"choices": [
{
"index": 0,
"text": " and How Does it Work?\nQuantum computing is a new type of computing that uses the principles of quantum mechanics to perform calculations and operations on data. It is based on the idea of using quantum bits or qubits, which can exist in multiple states simultaneously, allowing for exponential speedup over classical computers.\n**How Does Quantum Computing Work?**\n\nQuantum computing works by harnessing the unique properties of quantum mechanics, such as superposition and entanglement, to perform calculations. Here's a simplified overview of how it works:\n\n1. **Quantum Bits (Qubits):** Qubits are the fundamental units of quantum information. They are unique in that they can exist in multiple states simultaneously, known as a superposition. This means that a qubit can represent both 0 and 1 at the same time, unlike classical bits which can only be 0 or 1.\n2. **Superposition:** Qubits can exist in a superposition of states, meaning that they can represent multiple values simultaneously. This allows quantum computers to process multiple possibilities at the same time, which is known as parallel processing.\n3. **Entanglement:** Qubits can also become \"entangled,\" meaning that their properties are connected in a way that can't be explained by classical physics. This allows quantum computers to perform calculations on multiple qubits simultaneously, even if they are separated by large distances.\n4. **Quantum Gates:** Quantum gates are the quantum equivalent of logic gates in classical computing. They are the quantum operations that manipulate qubits, such as addition, subtraction, and rotation.\n5. **Quantum Algorithms:** Quantum algorithms are the specific instructions that a quantum computer follows to perform calculations. These algorithms are designed to take advantage of the unique properties of qubits and superposition.\n\n**Types of Quantum Computing:**\n\nThere are two main types of quantum computing:\n\n1. **Quantum Supremacy:** This is the ability to perform a calculation that is exponentially faster than a classical computer. This is achieved by using multiple qubits to perform the calculation.\n2. **Quantum Adversarial Search:** This is the ability to search for a solution to a problem by simulating all possible solutions. This is achieved by using a quantum computer to search for a solution in a large search space.\n\n**Applications of Quantum Computing:**\n\nQuantum computing has the potential to revolutionize many fields, including:\n\n1. **Cryptography:** Quantum computers can break many classical encryption algorithms, but they",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 6,
"total_tokens": 506,
"completion_tokens": 500,
"prompt_tokens_details": null
}
}
And it works as expected. Now let's check memory consumption using the jtop monitoring solution.
To scale the number of replicas, update the replicaCount
in values.yaml
and redeploy the Helm chart. Scaling the number of replicas can improve performance and availability, but it's crucial to monitor resource consumption and ensure your Kubernetes cluster has sufficient resources to accommodate the increased load.
Example output after setting replicaCount
to 3:
kubectl get pods
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-77bc5f9d67-7ghv4 1/1 Running 0 7m4s
vllm-llama3-deployment-vllm-69759fbdc-22r75 1/1 Running 0 7m4s
vllm-llama3-deployment-vllm-69759fbdc-fqlq7 1/1 Running 0 7m4s
vllm-llama3-deployment-vllm-69759fbdc-k7d6c 1/1 Running 0 7m4s
High memory usage can lead to performance degradation or even Out-of-Memory (OOM) errors, causing pods to crash. The memory consumption is
After updating replicaCount
to 4 and redeploying, checking the pods with kubectl get pods might show a pending pod due to insufficient CPU resources:
kubectl get pods
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-77bc5f9d67-rr6v7 1/1 Running 0 2m4s
vllm-llama3-deployment-vllm-69759fbdc-92jh4 0/1 Running 0 2m4s
vllm-llama3-deployment-vllm-69759fbdc-jkgqw 0/1 Running 0 2m4s
vllm-llama3-deployment-vllm-69759fbdc-ld9gb 0/1 Pending 0 2m4s
vllm-llama3-deployment-vllm-69759fbdc-s6srs 0/1 Running 0 2m3s
When you scale the replicaCount in your values.yaml
, you are instructing Kubernetes to create more instances (pods) of your vLLM deployment. Each of these pods will require CPU and memory resources.
This walkthrough demonstrates how to deploy the powerful Llama 3.1 8B Instruct language model on a Kubernetes cluster, leveraging the vLLM inference engine for efficient serving
kubectl get pods
NAME READY STATUS RESTARTS AGE
vllm-deployment-router-77bc5f9d67-dlk87 1/1 Running 0 8m16s
vllm-llama3-deployment-vllm-85f968795-vx9bw 1/1 Running 0 8m16s
vllm-llama3-deployment-vllm-85f968795-zrjjq 1/1 Running 0 8m16s
Now that the deployment is up and running, let's send a test request to ensure the model is working correctly.
curl -X POST http://localhost:30080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"prompt": "What Is Quantum Computing?",
"max_tokens": 100
}'
If the request is successful, you will receive a JSON response similar to this:
{
"id": "cmpl-eab4d42d82b2451d8d45876ee93bb838",
"object": "text_completion",
"created": 1744881393,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [
{
"index": 0,
"text": " And Why Is It A Game Changer?\nQuantum computing is a new paradigm for computing that uses the principles of quantum mechanics to perform calculations that are exponentially faster than classical computers. It is a game-changer because it has the potential to solve complex problems that are currently unsolvable or require an unfeasible amount of time to solve using classical computers.\nClassical computers use bits to store and process information, which can only be in one of two states: 0 or 1. Quantum computers",
"logprobs": null,
"finish_reason": "length",
"stop_reason": null,
"prompt_logprobs": null
}
],
"usage": {
"prompt_tokens": 6,
"total_tokens": 106,
"completion_tokens": 100,
"prompt_tokens_details": null
}
}
You can now verify memory consumption using monitoring tools such as jtop.
Congratulations! You have successfully deployed the Llama 3.1 8B Instruct language model with two replicas on your Kubernetes cluster using vLLM. This setup provides a scalable and potentially more resilient inference service.
To uninstall the stack, run:
helm uninstall vllm
Monitoring and Logging using Grafana and PrometheusMonitoring and logging are critical for maintaining and troubleshooting LLM applications. Prometheus is an open-source monitoring and alerting tool that collects and stores time-series data, while Grafana is a popular data visualization platform that allows you to create interactive dashboards and visualizations.
Details can be found here: https://github.com/vllm-project/production-stack/tree/main/observability
Navigate to the observability folder. Then run the installation script:
sudo bash install.sh
Installation output:
"prometheus-community" already exists with the same configuration, skipping
Release "kube-prom-stack" does not exist. Installing it now.
NAME: kube-prom-stack
LAST DEPLOYED: Mon Apr 14 22:58:27 2025
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
NOTES:
kube-prometheus-stack has been installed. Check its status by running:
kubectl --namespace monitoring get pods -l "release=kube-prom-stack"
Get Grafana 'admin' user password by running:
kubectl --namespace monitoring get secrets kube-prom-stack-grafana -o jsonpath="{.data.admin-password}" | base64 -d ; echo
Access Grafana local instance:
export POD_NAME=$(kubectl --namespace monitoring get pod -l "app.kubernetes.io/name=grafana,app.kubernetes.io/instance=kube-prom-stack" -oname)
kubectl --namespace monitoring port-forward $POD_NAME 3000
Visit https://github.com/prometheus-operator/kube-prometheus for instructions on how to create & configure Alertmanager and Prometheus instances using the Operator.
NAME: prometheus-adapter
LAST DEPLOYED: Mon Apr 14 22:59:14 2025
NAMESPACE: monitoring
STATUS: deployed
REVISION: 1
TEST SUITE: None
NOTES:
prometheus-adapter has been deployed.
In a few minutes you should be able to list metrics using the following command(s):
kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1
The kube-prometheus-stack has been successfully installed. Forward the Grafana dashboard port to the local node-port:
kubectl --namespace monitoring port-forward svc/kube-prom-stack-grafana 3000:80 --address 0.0.0.0
forward the Prometheus dashboard:
kubectl --namespace monitoring port-forward prometheus-kube-prom-stack-kube-prome-prometheus-0 9090:9090
So, now you can browse the Grafana Dashboard.
Visit http://<IP_Address>:3000
to view the dashboard.
By combining these tools, you can gain valuable insights into your Kubernetes cluster’s performance and health, making it easier to identify and troubleshoot issues, including autoscaling, monitoring, and service discovery, and your setup is built to handle real-world demands effectively.
Thank you for reading this tutorial on how to install K3s with the vLLM inference engine on the NVIDIA Jetson AGX Orin developer kit. I hope you found it useful!
Thanks and kudos to Johnny Núñez Cano and Dustin Franklin for their tremendous work in bringing vLLM support to the Nvidia Jetson Orin device. Special thanks also to AastaLLL and Whitesscott from the NVIDIA forums for their assistance with Kubernetes.
References:
Comments
Please log in or sign up to comment.