Triton Inference Server is an open-source platform designed to streamline the deployment and execution of machine learning models. This powerful tool enables you to deploy models from various deep learning frameworks, including TensorRT, TensorFlow, PyTorch, and ONNX, on a wide range of hardware.
Kubernetes has become the dominant container orchestration platform, but running a full Kubernetes cluster can be resource-intensive. Lightweight options like K3s, Minikube and microK8s allow users to easily create single-node clusters locally for development and testing. In this project, I'll leverage the lightweight Kubernetes distribution - K3s, to deploy Triton Inference Server on a single-node cluster hosted on the NVIDIA Jetson AGX Orin 64GB Developer Kit. K3s is developed by Rancher Labs, and optimized for IoT and edge computing scenarios, making it an ideal choice for our deployment. To efficiently store and manage our AI models, I'll utilize MinIO, a high-performance object storage solution. MinIO will serve as our model repository, allowing us to easily access and deploy different models as needed.
The following diagram shows the target infrastructure that you will obtain after following this blog post.
In this project, I'll focus on deploying Ultralytics YOLOv8 in TensorRT format, a cutting-edge object detection model. YOLOv8 represents the latest advancement in the YOLO series, renowned for its real-time performance and accuracy.
Set up the K3s on the NVIDIA Jetson AGX Orin 64GB Developer KitRun the following command to download and install the latest stable version of K3s on the NVIDIA Jetson AGX Orin 64GB Developer Kit:
curl -sfL https://get.k3s.io | sh
This command will download and install the latest stable version of K3s. The installation process will automatically start and run a single-node K3s cluster.
[INFO] Finding release for channel stable
[INFO] Using v1.30.3+k3s1 as release
[INFO] Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.3+k3s1/sha256sum-arm64.txt
[INFO] Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.30.3+k3s1/k3s-arm64
[INFO] Verifying binary download
[INFO] Installing k3s to /usr/local/bin/k3s
[INFO] Skipping installation of SELinux RPM
[INFO] Creating /usr/local/bin/kubectl symlink to k3s
[INFO] Creating /usr/local/bin/crictl symlink to k3s
[INFO] Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO] Creating killall script /usr/local/bin/k3s-killall.sh
[INFO] Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO] env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO] systemd: Creating service file /etc/systemd/system/k3s.service
[INFO] systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO] systemd: Starting k3s
Check the status of the K3s service to ensure it's running:
systemctl status k3s
This command will display the status of the K3s service. You should see output similar to the following:
Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
Active: active (running) since Sat 2024-08-17 11:55:41 +05; 1min 29s ago
Docs: https://k3s.io
Process: 885450 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
Process: 885469 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
Process: 885478 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
Main PID: 885488 (k3s-server)
Tasks: 175
Memory: 1.4G
CPU: 43.904s
CGroup: /system.slice/k3s.service
If you're using a Jetson device with an NVIDIA GPU, K3s should automatically detect the NVIDIA container runtime. You can verify this by checking the /var/lib/rancher/k3s/agent/etc/containerd/config.toml
file.
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
BinaryName = "/usr/bin/nvidia-container-runtime"
SystemdCgroup = true
To ensure the NVIDIA runtime is used by default and avoid potential issues, follow these steps:
cp /var/lib/rancher/k3s/agent/etc/containerd/config.toml /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl
Edit the config.toml.tmpl file and add the following lines under the [plugins."io.containerd.grpc.v1.cri".containerd]
section:
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "overlayfs"
disable_snapshot_annotations = true
default_runtime_name = "nvidia"
Restart containerd service
systemctl restart containerd
The K3s installation script install kubectl
binary automatically for you. Use kubectl get nodes to view the available nodes in your cluster.
sudo kubectl get nodes
This command will display the nodes in the cluster. You should see output similar to the following:
NAME STATUS ROLES AGE VERSION
ubuntu Ready control-plane,master 5m51s v1.30.3+k3s1
You can also use the following command to display more detailed information about the cluster:
sudo kubectl cluster-info
You should see output similar to the following:
Kubernetes control plane is running at https://127.0.0.1:6443
CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy
To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
To deploy a test application, create a YAML file called test.yaml
with the following contents:
apiVersion: v1
kind: Pod
metadata:
name: torch
spec:
imagePullSecrets:
- name: my-image-pull-secret
containers:
- name: torchtest
image: dustynv/l4t-pytorch:r36.2.0
securityContext:
privileged: true
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
Deploy the application using the following command:
sudo kubectl apply -f test.yaml
Verify the pod is running
NAME READY STATUS RESTARTS AGE
torch 1/1 Running 0 111s
Once the status of the pod is in running state, we can access the python terminal
sudo kubectl exec -it torch -- python3
You can now interact with the Python terminal and verify that the NVIDIA GPU is available:
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.get_device_name()
'Orin'
>>> torch.cuda.is_available()
True
To interact with the K3s Kubernetes cluster without sudo, you need to configure the kubectl
command-line tool to communicate with the K3s API server.
mkdir ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config && sudo chown $USER ~/.kube/config
sudo chmod 600 ~/.kube/config && export KUBECONFIG=~/.kube/config
Now that you have set up the kubeconfig
file, you can use kubectl
to access the cluster from your local machine. To ensure everything is set up correctly, let’s verify our cluster’s status:
kubectl get nodes
You should see output similar to the following:
kubectl get pods
NAME READY STATUS RESTARTS AGE
torch 1/1 Running 1 (18m ago) 46m
Check the health of your cluster pods
Congratulations You have successfully set up a single-node K3s Kubernetes cluster on the NVIDIA Jetson AGX Orin 64GB Developer Kit.
Building Triton Inference Server from Source with S3 SupportThe official Docker image of Triton Inference Server does not include S3 support by default for the igpu
images. To enable S3 support, we need to build the server from source.
Here is the docker command example without S3 support
docker run --runtime nvidia --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3-igpu tritonserver --model-repository=/models
To build Triton Inference Server from source, we will use a bash script that automates the process. Create a new file called build_triton.sh with the following contents:
#!/usr/bin/env bash
TRITON_VERSION="${1}"
[[ -z "${TRITON_VERSION}" ]] && TRITON_VERSION="24.08"
IMAGE_NAME="tritonserver"
OFFICIAL_MIN_IMAGE_TAG="${TRITON_VERSION}-py3-igpu-min"
CUSTOM_IMAGE_TAG="${TRITON_VERSION}-igpu-s3"
# Create a directory for Triton and clone the repository
rm -rf triton
mkdir triton && cd triton
git clone --recurse-submodules https://github.com/triton-inference-server/server.git
cd server
# Checkout the desired Triton version
git checkout "r${TRITON_VERSION}"
# Build the Triton Inference Server
sudo python3 build.py \
--build-parallel 10 \
--no-force-clone \
--target-platform igpu \
--target-machine aarch64 \
--filesystem s3 \
--enable-gpu \
--enable-mali-gpu \
--enable-metrics \
--enable-logging \
--enable-stats \
--enable-cpu-metrics \
--enable-nvtx \
--backend onnxruntime \
--backend pytorch \
--backend tensorflow \
--backend python \
--backend tensorrt \
--endpoint http \
--endpoint grpc \
--min-compute-capability "5.3" \
--image "base,nvcr.io/nvidia/${IMAGE_NAME}:${OFFICIAL_MIN_IMAGE_TAG}" \
--image "gpu-base,nvcr.io/nvidia/${IMAGE_NAME}:${OFFICIAL_MIN_IMAGE_TAG}"
# Tag the image locally without pushing to a registry
docker tag "${IMAGE_NAME}:latest" "${IMAGE_NAME}:${CUSTOM_IMAGE_TAG}"
echo "Docker image '${IMAGE_NAME}:${CUSTOM_IMAGE_TAG}' created successfully."
To run the build script, save the file and make it executable by running the following command:
chmod +x build_triton.sh
Then, run the script by executing the following command:
./build_triton.sh
This will build the Triton Inference Server from source with S3 support. The build process may take several hours to complete.
After the build is complete, verify that the Docker image has been created successfully by running the docker image ls
command:
tritonserver 24.08-igpu-s3 0d00f465cfca 4 days ago 9.81GB
I pushed the Docker image to the Docker Hub container registry, making it available for others to use. You can find the image here.
Then we can check it by downloading different models for the Triton. To do this, we need to clone the Triton server GitHub repository:
git clone https://github.com/triton-inference-server/server.git
Then, navigate to the docs/examples directory and run the fetch_models.sh script to download the models:
cd server/docs/examples
./fetch_models.sh
Then check the content of folder using ls -l
command:
Now that we have the models, we can run the Triton server using Docker. Run the following command to start the server:
sudo docker run --runtime nvidia --rm --net=host -v ${PWD}/model_repository:/models tritonserver:24.08-igpu-s3 tritonserver --model-repository=/models --allow-gpu-metrics=false --strict-model-config=false --exit-on-error=false --strict-readiness=false
This command starts the Triton server with the downloaded models and maps the model_repository folder to the /models
directory inside the container
The output of the command should look like this:
E1025 13:11:27.645363 1 tritonserver.cc:2607] "Internal: failed to load all models"
I1025 13:11:27.649423 1 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I1025 13:11:27.650101 1 http_server.cc:4694] "Started HTTPService at 0.0.0.0:8000"
I1025 13:11:27.691704 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
GRPC, HTTP, and metrics services have started successfully.
Next, we need to create the model repository. The model repository is the directory in which you place the AI models you want Triton Inference Server to serve.
Setting Up MinIO with Docker ComposeMinIO is an open-source object storage server designed to store unstructured data (such as photos, videos, log files, backups, and container/VM images).
Before we begin, we need to create a model repository. The model repository is a directory where you will store the AI models that you want Triton Inference Server to serve. In this example, we will use the /mnt/minio_data
directory to store our MinIO data.
sudo mkdir -p /mnt/minio_data
Create a new file called docker-compose.yml
and add the following contents:
version: '3.8'
services:
s3service:
image: quay.io/minio/minio:latest
command: server --console-address ":9001" /data
ports:
- '9000:9000'
- '9001:9001'
volumes:
- /mnt/minio_data:/data
env_file: minio.env
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
interval: 10s
timeout: 5s
retries: 5
initialize-s3service:
image: quay.io/minio/mc
depends_on:
s3service:
condition: service_healthy
entrypoint: >
/bin/sh -c '
/usr/bin/mc alias set s3service http://s3service:9000 "$${MINIO_ROOT_USER}" "$${MINIO_ROOT_PASSWORD}";
/usr/bin/mc mb s3service/"$${BUCKET_NAME}";
/usr/bin/mc admin user add s3service "$${ACCESS_KEY}" "$${SECRET_KEY}";
/usr/bin/mc admin policy attach s3service readwrite --user "$${ACCESS_KEY}";
exit 0;
'
env_file: minio.env
This file defines two services: s3service and initialize-s3service. The s3service service runs the MinIO server, while the initialize-s3service service initializes the MinIO bucket and user. The bucket, like the database, is stored on an external volume in the directory /mnt/minio_data
Next, we need to prepare the environment file minio.env
that contains the MinIO credentials and bucket name. Create a new file called minio.env
and add the following contents:
MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=password
BUCKET_NAME=model-repo
# Development credentials for storing files locally
ACCESS_KEY=VPP0fkoCyBZx8YU0QTjH
SECRET_KEY=iFq6k8RLJw5B0faz0cKCXeQk0w9Q8UdtaFzHuw4Js
To deploy the services, run the following command:
docker compose up -d
This command will start the services in detached mode.
To verify that the services are running, you can check the output of the docker compose up
command:
[+] Running 3/3
✔ Network minio_advanced_default Created 0.1s
✔ Container minio_advanced-s3service-1 Created 0.1s
✔ Container minio_advanced-initialize-s3service-1 Created 0.1s
Attaching to initialize-s3service-1, s3service-1
s3service-1 | MinIO Object Storage Server
s3service-1 | Copyright: 2015-2024 MinIO, Inc.
s3service-1 | License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
s3service-1 | Version: RELEASE.2024-10-13T13-34-11Z (go1.22.8 linux/arm64)
s3service-1 |
s3service-1 | API: http://172.18.0.2:9000 http://127.0.0.1:9000
s3service-1 | WebUI: http://172.18.0.2:9001 http://127.0.0.1:9001
s3service-1 |
s3service-1 | Docs: https://docs.min.io
initialize-s3service-1 | Added `s3service` successfully.
initialize-s3service-1 | Bucket created successfully `s3service/model-repo`.
initialize-s3service-1 | Added user `VPP0fkoCyBZx8YU0QTjH` successfully.
initialize-s3service-1 | Attached Policies: [readwrite]
initialize-s3service-1 | To User: VPP0fkoCyBZx8YU0QTjH
initialize-s3service-1 exited with code 0
This output shows that the services are running and that the MinIO bucket and user have been created.
To access the MinIO web interface, open a web browser and navigate to http://<server-ip>:9000.
Replace <server-ip>
with the IP address of your server.
Use the credentials specified in the .env
file.
Screen after login. Verify that a bucket was created via the web interface:
Verify that a user was created via the web interface
To upload and download files from MinIO, you need to set up a client using different ways discribed below.
Option 1: Setting up the MinIO ClientTo install the MinIO client, run the following command:
curl https://dl.min.io/client/mc/release/linux-arm64/mc \
--create-dirs \
-o ~/minio-binaries/mc
This command will download and install the MinIO client.
Make the downloaded file executable:
chmod +x $HOME/minio-binaries/mc
export PATH=$PATH:$HOME/minio-binaries/
To log in to the MinIO client, run the following command:
mc alias set myminio http://localhost:9000 username password
This should output something like:
Added `myminio` successfully.
To test the connection, run the following command:
mc admin info myminio
This command will display information about the MinIO server.
● localhost:9000
Uptime: 1 hour
Version: 2024-08-17T01:24:54Z
Network: 1/1 OK
Drives: 1/1 OK
Pool: 1
┌──────┬────────────────────────┬─────────────────────┬──────────────┐
│ Pool │ Drives Usage │ Erasure stripe size │ Erasure sets │
│ 1st │ 52.1% (total: 868 GiB) │ 1 │ 1 │
└──────┴────────────────────────┴─────────────────────┴──────────────┘
0 B Used, 1 Bucket, 0 Objects
1 drive online, 0 drives offline, EC:0
Run the command from the model_repository
directory to copy the files to the bucket.
mc cp --recursive ./model_repository/ myminio/model-repo/
Check the MinIO web interface to verify the upload:
AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. It works with any S3 compatible cloud storage service like e.g. MinIO.
Download the AWS CLI installation package:
curl -O 'https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip'
Once downloaded, unzip the package and run the installation script:
unzip awscli-exe-linux-aarch64.zip
sudo ./aws/install
Verify the installation by running:
aws --version
Output:
aws-cli/2.17.42 Python/3.11.9 Linux/5.15.136-tegra exe/aarch64.ubuntu.22
Run the following command to configure AWS CLI:
aws configure
Output:
AWS Access Key ID [None]: G1OuuopbWD5thjk41EzT
AWS Secret Access Key [None]: xbn2Iseq7c1t5dwzPSBMcPsgf5spqaPGMgEIcQnm
Default region name [None]: us-east-1
Default output format [None]:
Set the default S3 signature version:
aws configure set default.s3.signature_version s3v4
List the buckets:
aws --endpoint-url http://192.168.0.9:9000 s3 ls
2024-08-31 00:25:49 model-repo
Upload a folder to the bucket:
aws s3 cp /home/jetson/Projects/triton/server/docs/examples/model_repository s3://model-repo/ --endpoint-url http://192.168.0.9:9000 --recursive
Deploy Triton Inference ServerTo access your Minio S3 object storage, you need to create a secret key. You can do this using kubectl
command:
kubectl create secret generic aws-credentials --from-literal=AWS_ACCESS_KEY_ID=VPP0fkoCyBZx8YU0QTjH --from-literal=AWS_SECRET_ACCESS_KEY=iFq6k8RLJw5B0faz0cKCXeQk0w9Q8UdtaFzHuw4Js
Replace the placeholder values with your access credentials in the .env
file.
To deploy the Triton Inference Server, you need to create a Kubernetes deployment configuration. Here is an example triton_deployment.yaml
file:
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-deploy
labels:
app: triton-app
spec:
replicas: 3
selector:
matchLabels:
app: triton-app
template:
metadata:
labels:
app: triton-app
spec:
containers:
- name: triton-container
ports:
- containerPort: 8000
name: http-triton
- containerPort: 8001
name: grpc-triton
- containerPort: 8002
name: metrics-triton
image: "shahizat005/tritonserver:24.08-igpu-s3"
command: ["/bin/bash"]
args: ["-c", "cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates && update-ca-certificates && /opt/tritonserver/bin/tritonserver --model-store=s3://192.168.0.9:9000/model-repo --strict-model-config=false"]
env:
- name: AWS_ACCESS_KEY_ID
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_ACCESS_KEY_ID
- name: AWS_SECRET_ACCESS_KEY
valueFrom:
secretKeyRef:
name: aws-credentials
key: AWS_SECRET_ACCESS_KEY
livenessProbe:
failureThreshold: 60
initialDelaySeconds: 600
periodSeconds: 5
httpGet:
path: /v2/health/live
port: http-triton
readinessProbe:
failureThreshold: 60
initialDelaySeconds: 600
periodSeconds: 5
httpGet:
path: /v2/health/ready
port: http-triton
The replica is set to 3. Depending on your requirements, you can change the GPU allocation and replica counts. The livenessProbe
and readinessProbe
ensure that the containers are alive and ready to serve incoming traffic.
Create the deployment manifest file:
kubectl apply -f triton_deployment.yaml
Check the status of the pods to ensure they are running correctly:
kubectl get pods
The expected sample output is as follows:
NAME READY STATUS RESTARTS AGE
triton-deploy-6845554ffd-2fjnh 0/1 Running 0 8m23s
triton-deploy-6845554ffd-97wr2 0/1 Running 0 8m23s
triton-deploy-6845554ffd-stgsc 0/1 Running 0 8m23s
The server logs will display the status of the loaded models and the start-up messages for the gRPC, HTTP, and Metrics services.
| Model | Version | Status |
+----------------------+---------+--------+
| densenet_onnx | 1 | READY |
| inception_graphdef | 1 | READY |
| simple | 1 | READY |
| simple_dyna_sequence | 1 | READY |
| simple_identity | 1 | READY |
| simple_int8 | 1 | READY |
| simple_sequence | 1 | READY |
| simple_string | 1 | READY |
+----------------------+---------+--------+
The container exposes three ports:
8000
: HTTP port for health checks and model management.8001
: GRPC port for model inference requests.8002
: Metrics port for monitoring server performance.
I0825 10:34:59.260117 1 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0825 10:34:59.260426 1 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000"
I0825 10:34:59.301991 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"
Install MetalLB for Load BalancingLets install MetalLB and get it to assign IP addresses for Triton services.
kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml
Once that’s run, we can confirm that MetalLB is running by checking the pods in the metallb-system
namespace.
kubectl get pods -n metallb-system
Which should return something similar to.
NAME READY STATUS RESTARTS AGE
controller-6dd967fdc7-jxs8l 1/1 Running 0 8m25s
speaker-2rdcj 1/1 Running 0 8m24s
Then define the address-pools and addresses values as desired.
apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
name: first-pool
namespace: metallb-system
spec:
addresses:
- 192.168.0.10-192.168.0.20
Create a Kubernetes service for the Triton Inference Server:
apiVersion: v1
kind: Service
metadata:
name: triton-service
labels:
app: triton-app
spec:
selector:
app: triton-app
type: LoadBalancer
ports:
- name: http-triton
port: 8000
targetPort: 8000
nodePort: 30800
protocol: TCP
- name: grpc-triton
port: 8001
targetPort: 8001
nodePort: 30801
protocol: TCP
- name: metrics-triton
port: 8002
targetPort: 8002
nodePort: 30802
protocol: TCP
Apply this configuration:
kubectl apply -f triton_service.yaml
For each deployment, a service of type LoadBalancer is created. The Triton Inference Server can be accessed by using the LoadBalancer IP which is in the application network.
Then check using kubectl get svc
command:
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
kubernetes ClusterIP 10.43.0.1 <none> 443/TCP
triton-service LoadBalancer 10.43.138.92 192.168.0.11 8000:30918/TCP,8001:31112/TCP,8002:30934/TCP 5h56m
If the load balancer is set correctly, the Triton deployment should receive an external IP - 192.168.0.11
.
First, you need to confirm that the Triton server has started normally and can be accessed remotely. Here are the steps to check the server's status and metrics
To verify the health status of the Triton Inference Server, execute the following curl command:
curl -v 192.168.0.11:8000/v2/health/ready
The expected output should include the "HTTP/1.1 200 OK" status, indicating that the server is running correctly.
* Trying 192.168.0.11:8000...
* Connected to 192.168.0.11 (192.168.0.11) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: 192.168.0.11:8000
> User-Agent: curl/7.81.0
> Accept: */*
>
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
<
* Connection #0 to host 192.168.0.11 left intact
As long as the information that appears later contains the "HTTP/1.1 200 OK" part, everything is ok.
To monitor the server's performance metrics, use the following command:
curl -v 192.168.0.11:8002/metrics
Efficient utilization of GPU's memoryTriton Inference Server provides APIs to manage model loading and unloading, which helps in efficient utilization of GPU resources.
You can call the POST API to load the model
curl -X POST http://192.168.0.11:8000/v2/repository/models/<Model Name>/load
You can call the POST API to unload the model
curl -X POST http://192.168.0.11:8000/v2/repository/models/<Model Name>/unload
Allowing multiple models to share the GPU memory. This can help optimize memory usage and improve performance.
Run Triton Inference ClientTo communicate with the Triton server, you can use the provided C++ and Python client libraries.
Clone the Triton Client Repository:
git clone https://github.com/triton-inference-server/client.git
Create a requirements.txt file with the following dependencies:
pillow
numpy
attrdict
tritonclient
google-api-python-client
grpcio
geventhttpclient
boto3
Then, install these dependencies using:
pip3 install -r requirements.txt
Navigate to the client/src/python/examples
directory and execute the client script. Here is an example using the image_client.py
script:
python3 image_client.py \
-u 192.168.0.11:8000 \
-m inception_graphdef \
-s INCEPTION \
-x 1 \
-c 1 \
test.png
This command will send an inference request to the Triton server and display the results. The output should look something like this:
Request 1, batch size 1
0.430289 (505) = COFFEE MUG
PASS
Adding Custom YOLO Models to the Triton Inference EngineIt's highly recommended to use the same Docker container for both converting your custom model to the TensorRT engine format and deploying it on Triton. This helps avoid potential incompatibilities that might arise when using different environments.
Run the following command to start the Docker container:
sudo docker run --runtime nvidia -it --ipc=host tritonserver:24.08-igpu-s3
Setting up YOLOv8 for your computer vision projects is a straightforward process. It begins with creating a virtual environment to manage Python dependencies efficiently.
Install the Python virtual environment package using pip:
pip3 install virtualenv
apt install python3.10-venv
Create a virtual environment within the current directory:
python3 -m venv yolo
Activate the new virtual environment:
source yolo/bin/activate
Install the YOLO library (e.g., ultralytics) and other dependencies within the virtual environment:
pip install ultralytics
Download the pretrained YOLOv8 model:
wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8l.pt
Use YOLO's export function to convert the model into the Open Neural Network Exchange (ONNX) format, which is interoperable with various frameworks:
yolo export model=./yolov8l.pt imgsz=640 format=onnx opset=11
This command exports the yolov8l.pt
model to the ONNX format with an image size of 640 pixels and opset level 11.
Once the exporting is complete, we can proceed to the converting to tensorRT format. Use the trtexec
tool from the TensorRT installation to convert the ONNX model to a TensorRT engine:
/usr/src/tensorrt/bin/trtexec \
--onnx=yolov8l.onnx \
--saveEngine=yolov8l.engine \
--streams=8
This command converts the yolov8l.onnx
model to a TensorRT engine named yolov8l.engine
using 8 streams (threads) for improved performance
For faster inference, you can convert the model to use the FP16 (half-precision) data type. Use the following flags with trtexec
/usr/src/tensorrt/bin/trtexec \
--onnx=yolov8l.onnx \
--saveEngine=yolov8l.engine \
--fp16 \
--inputIOFormats=fp16:chw \
--outputIOFormats=fp16:chw \
--streams=8
To be able use TYPE_FP16
as data type for input and output, use --inputIOFormats
and --outputIOFormats
when parsing the model from ONNX.
To inspect a model using the trtexec
command, you can use the --loadEngine
trtexec --loadEngine=model.plan --verbose
Create a directory structure to organize your models and configuration files:
./upload/
├── yolov8l_fp16
│ ├── 1
│ │ └── yolov8l.engine
│ └── config.pbtxt
└── yolov8l_fp32
├── 1
│ └── yolov8l.engine
└── config.pbtxt
4 directories, 4 files
Under the same model folder, we have a “config.pbtxt” (protobuf text) which describes the model configuration
name: "yolov8l_fp32"
platform: "tensorrt_plan"
max_batch_size : 0
input [
{
name: "images"
data_type: TYPE_FP32
dims: [ -1, 3, 640, 640 ]
}
]
output [
{
name: "output0"
data_type: TYPE_FP32
dims: [ -1, 84, 8400 ]
}
]
Upload the Model to MinIO S3 Model Repository
mc cp --recursive ./upload/ myminio/model-repo/
Then deploy triton
kubectl apply -f triton-deploy.yaml
You will see the following output:
sudo kubectl get pods
NAME READY STATUS RESTARTS AGE
triton-deploy-6845554ffd-5rbbj 0/1 Running 0 6s
triton-deploy-6845554ffd-6zjwg 0/1 Running 0 6s
triton-deploy-6845554ffd-kthrg 0/1 Running 0 6s
Check the logs using kubectl logs to verify that the model was loaded successfully:
+----------------------+---------+--------+
| Model | Version | Status |
+----------------------+---------+--------+
| densenet_onnx | 1 | READY |
| inception_graphdef | 1 | READY |
| simple | 1 | READY |
| simple_dyna_sequence | 1 | READY |
| simple_identity | 1 | READY |
| simple_int8 | 1 | READY |
| simple_sequence | 1 | READY |
| simple_string | 1 | READY |
| yolov8l_fp16 | 1 | READY |
| yolov8l_fp32 | 1 | READY |
+----------------------+---------+--------+
"Started GRPCInferenceService at 0.0.0.0:8001"
"Started HTTPService at 0.0.0.0:8000"
"Started Metrics Service at 0.0.0.0:8002"
All models were loaded successfully. Once deployed, use the IP address and the following command to call the inference: Run Inference using below Python Code
from ultralytics.utils import ROOT, yaml_load
from ultralytics.utils.checks import check_yaml
import time
import cv2
import numpy as np
from PIL import Image
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient(url="192.168.0.11:8000")
CLASSES = yaml_load(check_yaml('coco128.yaml'))['names']
colors = np.random.uniform(0, 255, size=(len(CLASSES), 3))
def draw_bounding_box(img, class_id, confidence, x, y, x_plus_w, y_plus_h):
label = f'{CLASSES[class_id]} ({confidence:.2f})'
color = colors[class_id]
cv2.rectangle(img, (x, y), (x_plus_w, y_plus_h), color, 2)
cv2.putText(img, label, (x - 10, y - 10), cv2.FONT_HERSHEY_DUPLEX, 0.7, color, 1)
def main():
# Pre Processing
original_image = cv2.imread("./test.jpg")
or_copy = original_image.copy()
[height, width, _] = original_image.shape
length = max((height, width))
scale = length / 640
image = np.zeros((length, length, 3), np.uint8)
image[0:height, 0:width] = original_image
resize = cv2.resize(image, (640, 640))
img = resize[np.newaxis, :, :, :] / 255.0
img = img.transpose((0, 3, 1, 2)).astype(np.float32)
inputs = httpclient.InferInput("images", img.shape, datatype="FP32")
inputs.set_data_from_numpy(img, binary_data=True)
outputs = httpclient.InferRequestedOutput("output0", binary_data=True)
# Inference
start_time = time.time()
res = client.infer(model_name="yolov8l", inputs=[inputs], outputs=[outputs]).as_numpy('output0')
end_time = time.time()
inf_time = (end_time - start_time)
print(f"inference time: {inf_time*1000:.3f} ms")
# Post Processing
outputs = np.array([cv2.transpose(res[0].astype(np.float32))])
rows = outputs.shape[1]
boxes = []
scores = []
class_ids = []
for i in range(rows):
classes_scores = outputs[0][i][4:]
(minScore, maxScore, minClassLoc, (x, maxClassIndex)) = cv2.minMaxLoc(classes_scores)
if maxScore >= 0.25:
box = [
outputs[0][i][0] - (0.5 * outputs[0][i][2]), outputs[0][i][1] - (0.5 * outputs[0][i][3]),
outputs[0][i][2], outputs[0][i][3]]
boxes.append(box)
scores.append(maxScore)
class_ids.append(maxClassIndex)
result_boxes = cv2.dnn.NMSBoxes(boxes, scores, 0.25, 0.45, 0.5)
detections = []
for i in range(len(result_boxes)):
index = result_boxes[i]
box = boxes[index]
detection = {
'class_id': class_ids[index],
'class_name': CLASSES[class_ids[index]],
'confidence': scores[index],
'box': box,
'scale': scale}
detections.append(detection)
draw_bounding_box(or_copy, class_ids[index], scores[index], round(box[0] * scale), round(box[1] * scale),
round((box[0] + box[2]) * scale), round((box[1] + box[3]) * scale))
# Save the output image to a file
cv2.imwrite("detection_result.jpg", or_copy)
print("Result saved as detection_result.jpg")
return or_copy
if __name__ == "__main__":
result_image = main()
Inference time for FP32 based precision model:
inference time: 210.176 ms
To test FP16, modify the code to use FP16 data types and run inference again:
img = img.transpose((0, 3, 1, 2)).astype(np.float16)
inputs = httpclient.InferInput("images", img.shape, datatype="FP16")
res = client.infer(model_name="yolov8l_fp16", inputs=[inputs], outputs=[outputs]).as_numpy('output0')
Inference time for FP16 model:
inference time: 131.801 ms
Here are two example use cases that demonstrate the model's object detection capabilities. The first example uses an image of a black and white ceramic tea cup with a saucer on a black wooden table.
The model successfully detects the tea cup and the dining table in the image, as evident from the detection result with the bounding box.
The second example uses an image of a light workplace with a laptop.
These examples were used to demonstrate the YOLO model's ability to detect objects.
For video inference, you can use the following code snippet:
from ultralytics.utils import ROOT, yaml_load
from ultralytics.utils.checks import check_yaml
import time
import cv2
import numpy as np
from PIL import Image
import tritonclient.http as httpclient
client = httpclient.InferenceServerClient(url="192.168.0.10:8000")
CLASSES = yaml_load(check_yaml('coco128.yaml'))['names']
colors = np.random.uniform(0, 255, size=(len(CLASSES), 3))
def draw_bounding_box(img, class_id, confidence, x, y, x_plus_w, y_plus_h):
label = f'{CLASSES[class_id]} ({confidence:.2f})'
color = colors[class_id]
cv2.rectangle(img, (x, y), (x_plus_w, y_plus_h), color, 2)
cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)
def process_frame(frame):
# Pre-process each frame
original_image = frame
or_copy = original_image.copy()
height, width, _ = original_image.shape
length = max(height, width)
scale = length / 640
image = np.zeros((length, length, 3), np.uint8)
image[0:height, 0:width] = original_image
resize = cv2.resize(image, (640, 640))
img = resize[np.newaxis, :, :, :] / 255.0
img = img.transpose((0, 3, 1, 2)).astype(np.float32)
# Inference
inputs = httpclient.InferInput("images", img.shape, datatype="FP32")
inputs.set_data_from_numpy(img, binary_data=True)
outputs = httpclient.InferRequestedOutput("output0", binary_data=True)
start_time = time.time()
res = client.infer(model_name="yolov8l", inputs=[inputs], outputs=[outputs]).as_numpy('output0')
end_time = time.time()
inf_time = (end_time - start_time)
print(f"inference time: {inf_time*1000:.3f} ms")
# Post-process
outputs = np.array([cv2.transpose(res[0].astype(np.float32))])
rows = outputs.shape[1]
boxes = []
scores = []
class_ids = []
for i in range(rows):
classes_scores = outputs[0][i][4:]
(minScore, maxScore, minClassLoc, (x, maxClassIndex)) = cv2.minMaxLoc(classes_scores)
if maxScore >= 0.25:
box = [
outputs[0][i][0] - (0.5 * outputs[0][i][2]), outputs[0][i][1] - (0.5 * outputs[0][i][3]),
outputs[0][i][2], outputs[0][i][3]]
boxes.append(box)
scores.append(maxScore)
class_ids.append(maxClassIndex)
result_boxes = cv2.dnn.NMSBoxes(boxes, scores, 0.25, 0.45, 0.5)
for i in range(len(result_boxes)):
index = result_boxes[i]
box = boxes[index]
draw_bounding_box(
or_copy,
class_ids[index],
scores[index],
round(box[0] * scale),
round(box[1] * scale),
round((box[0] + box[2]) * scale),
round((box[1] + box[3]) * scale)
)
return or_copy
def main():
input_video = "video1.mp4" # Input video file path
output_video = "detection_result.mp4" # Output video file path
cap = cv2.VideoCapture(input_video)
width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
fps = cap.get(cv2.CAP_PROP_FPS)
# Define the codec and create VideoWriter object
fourcc = cv2.VideoWriter_fourcc(*'mp4v')
out = cv2.VideoWriter(output_video, fourcc, fps, (width, height))
while cap.isOpened():
ret, frame = cap.read()
if not ret:
break
# Process each frame and write to output video
processed_frame = process_frame(frame)
out.write(processed_frame)
# Release everything
cap.release()
out.release()
print(f"Detection result saved as {output_video}")
if __name__ == "__main__":
main()
Here's an example video demonstrating object detection with multiple objects using the Ultralytics YOLOv8 model.
This implementation serves as a starting point for developers looking to integrate different models into their video analytics applications on the edge. Finally, I want to conclude that I explored the way to set up and run Nvidia Triton Inference Server using K3s and MinIO on the NVIDIA Jetson AGX Orin 64GB Developer Kit.
I hope you found this guide useful and thanks for reading. If you have any questions or feedback? Leave a comment below. If you like this post, please support me by subscribing to my blog.
Thanks and acknowledgements- Video footage courtesy of Pexels.com(Mixkit-Free Video Assets link)
- Photos courtesy of Pexels.com(Nao Triponez link and Cup of Couple link)
- Thanks to Andreas Schliebitz for his help with the Triton Docker image.
Comments