Set up the K3s on the NVIDIA Jetson AGX Orin 64GB Developer Kit
Building Triton Inference Server from Source with S3 Support
Setting Up MinIO with Docker Compose
Option 1: Setting up the MinIO Client
Option 2: Connect with AWS CLI
Deploy Triton Inference Server
Install MetalLB for Load Balancing
Verify Triton Is Running Correctly
Efficient utilization of GPU's memory
Run Triton Inference Client
Adding Custom YOLO Models to the Triton Inference Engine
Thanks and acknowledgements
References

Published October 28, 2024 © GPL3+

Triton Inference Server on Nvidia Jetson using K3s and MinIO

In this blog, I’ll demonstrate the deployment of NVIDIA Triton Inference server on NVIDIA Jetson AGX Orin Dev Kit using K3s and Minio S3.

AdvancedFull instructions provided6 hours1,089

Things used in this project

Hardware components

NVIDIA Jetson AGX Orin Developer Kit

Story

Triton Inference Server is an open-source platform designed to streamline the deployment and execution of machine learning models. This powerful tool enables you to deploy models from various deep learning frameworks, including TensorRT, TensorFlow, PyTorch, and ONNX, on a wide range of hardware.

Kubernetes has become the dominant container orchestration platform, but running a full Kubernetes cluster can be resource-intensive. Lightweight options like K3s, Minikube and microK8s allow users to easily create single-node clusters locally for development and testing. In this project, I'll leverage the lightweight Kubernetes distribution - K3s, to deploy Triton Inference Server on a single-node cluster hosted on the NVIDIA Jetson AGX Orin 64GB Developer Kit. K3s is developed by Rancher Labs, and optimized for IoT and edge computing scenarios, making it an ideal choice for our deployment. To efficiently store and manage our AI models, I'll utilize MinIO, a high-performance object storage solution. MinIO will serve as our model repository, allowing us to easily access and deploy different models as needed.

The following diagram shows the target infrastructure that you will obtain after following this blog post.

In this project, I'll focus on deploying Ultralytics YOLOv8 in TensorRT format, a cutting-edge object detection model. YOLOv8 represents the latest advancement in the YOLO series, renowned for its real-time performance and accuracy.

Set up the K3s on the NVIDIA Jetson AGX Orin 64GB Developer Kit

Run the following command to download and install the latest stable version of K3s on the NVIDIA Jetson AGX Orin 64GB Developer Kit:

curl -sfL https://get.k3s.io | sh

This command will download and install the latest stable version of K3s. The installation process will automatically start and run a single-node K3s cluster.

[INFO]  Finding release for channel stable
[INFO]  Using v1.30.3+k3s1 as release
[INFO]  Downloading hash https://github.com/k3s-io/k3s/releases/download/v1.30.3+k3s1/sha256sum-arm64.txt
[INFO]  Downloading binary https://github.com/k3s-io/k3s/releases/download/v1.30.3+k3s1/k3s-arm64
[INFO]  Verifying binary download
[INFO]  Installing k3s to /usr/local/bin/k3s
[INFO]  Skipping installation of SELinux RPM
[INFO]  Creating /usr/local/bin/kubectl symlink to k3s
[INFO]  Creating /usr/local/bin/crictl symlink to k3s
[INFO]  Skipping /usr/local/bin/ctr symlink to k3s, command exists in PATH at /usr/bin/ctr
[INFO]  Creating killall script /usr/local/bin/k3s-killall.sh
[INFO]  Creating uninstall script /usr/local/bin/k3s-uninstall.sh
[INFO]  env: Creating environment file /etc/systemd/system/k3s.service.env
[INFO]  systemd: Creating service file /etc/systemd/system/k3s.service
[INFO]  systemd: Enabling k3s unit
Created symlink /etc/systemd/system/multi-user.target.wants/k3s.service → /etc/systemd/system/k3s.service.
[INFO]  systemd: Starting k3s

Check the status of the K3s service to ensure it's running:

systemctl status k3s

This command will display the status of the K3s service. You should see output similar to the following:

Loaded: loaded (/etc/systemd/system/k3s.service; enabled; vendor preset: enabled)
     Active: active (running) since Sat 2024-08-17 11:55:41 +05; 1min 29s ago
       Docs: https://k3s.io
    Process: 885450 ExecStartPre=/bin/sh -xc ! /usr/bin/systemctl is-enabled --quiet nm-cloud-setup.service 2>/dev/null (code=exited, status=0/SUCCESS)
    Process: 885469 ExecStartPre=/sbin/modprobe br_netfilter (code=exited, status=0/SUCCESS)
    Process: 885478 ExecStartPre=/sbin/modprobe overlay (code=exited, status=0/SUCCESS)
   Main PID: 885488 (k3s-server)
      Tasks: 175
     Memory: 1.4G
        CPU: 43.904s
     CGroup: /system.slice/k3s.service

If you're using a Jetson device with an NVIDIA GPU, K3s should automatically detect the NVIDIA container runtime. You can verify this by checking the /var/lib/rancher/k3s/agent/etc/containerd/config.tomlfile.

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia"]
  runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes."nvidia".options]
  BinaryName = "/usr/bin/nvidia-container-runtime"
  SystemdCgroup = true

To ensure the NVIDIA runtime is used by default and avoid potential issues, follow these steps:

cp /var/lib/rancher/k3s/agent/etc/containerd/config.toml /var/lib/rancher/k3s/agent/etc/containerd/config.toml.tmpl

Edit the config.toml.tmpl file and add the following lines under the [plugins."io.containerd.grpc.v1.cri".containerd] section:

[plugins."io.containerd.grpc.v1.cri".containerd]
  snapshotter = "overlayfs"
  disable_snapshot_annotations = true
  default_runtime_name = "nvidia"

Restart containerd service

systemctl restart containerd

The K3s installation script install kubectl binary automatically for you. Use kubectl get nodes to view the available nodes in your cluster.

sudo kubectl get nodes

This command will display the nodes in the cluster. You should see output similar to the following:

NAME     STATUS   ROLES                  AGE     VERSION
ubuntu   Ready    control-plane,master   5m51s   v1.30.3+k3s1

You can also use the following command to display more detailed information about the cluster:

sudo kubectl cluster-info

You should see output similar to the following:

Kubernetes control plane is running at https://127.0.0.1:6443
CoreDNS is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/kube-dns:dns/proxy
Metrics-server is running at https://127.0.0.1:6443/api/v1/namespaces/kube-system/services/https:metrics-server:https/proxy

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

To deploy a test application, create a YAML file called test.yaml with the following contents:

apiVersion: v1
kind: Pod
metadata:
  name: torch
spec:
  imagePullSecrets:
  - name: my-image-pull-secret
  containers:
  - name: torchtest
    image: dustynv/l4t-pytorch:r36.2.0
    securityContext:
      privileged: true
    command: [ "/bin/bash", "-c", "--" ]
    args: [ "while true; do sleep 30; done;" ]

Deploy the application using the following command:

sudo kubectl apply -f test.yaml

Verify the pod is running

NAME    READY   STATUS    RESTARTS   AGE
torch   1/1     Running   0          111s

Once the status of the pod is in running state, we can access the python terminal

sudo kubectl exec -it torch -- python3

You can now interact with the Python terminal and verify that the NVIDIA GPU is available:

Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.cuda.get_device_name()
'Orin'
>>> torch.cuda.is_available()
True

To interact with the K3s Kubernetes cluster without sudo, you need to configure the kubectl command-line tool to communicate with the K3s API server.

mkdir ~/.kube
sudo cp /etc/rancher/k3s/k3s.yaml ~/.kube/config && sudo chown $USER ~/.kube/config
sudo chmod 600 ~/.kube/config && export KUBECONFIG=~/.kube/config

Now that you have set up the kubeconfig file, you can use kubectl to access the cluster from your local machine. To ensure everything is set up correctly, let’s verify our cluster’s status:

kubectl get nodes

You should see output similar to the following:

kubectl get pods
NAME    READY   STATUS    RESTARTS      AGE
torch   1/1     Running   1 (18m ago)   46m

Check the health of your cluster pods

Congratulations You have successfully set up a single-node K3s Kubernetes cluster on the NVIDIA Jetson AGX Orin 64GB Developer Kit.

Building Triton Inference Server from Source with S3 Support

The official Docker image of Triton Inference Server does not include S3 support by default for the igpu images. To enable S3 support, we need to build the server from source.

Here is the docker command example without S3 support

docker run --runtime nvidia --rm --net=host -v ${PWD}/model_repository:/models nvcr.io/nvidia/tritonserver:24.07-py3-igpu tritonserver --model-repository=/models

To build Triton Inference Server from source, we will use a bash script that automates the process. Create a new file called build_triton.sh with the following contents:

#!/usr/bin/env bash

TRITON_VERSION="${1}"
[[ -z "${TRITON_VERSION}" ]] && TRITON_VERSION="24.08"

IMAGE_NAME="tritonserver"
OFFICIAL_MIN_IMAGE_TAG="${TRITON_VERSION}-py3-igpu-min"
CUSTOM_IMAGE_TAG="${TRITON_VERSION}-igpu-s3"

# Create a directory for Triton and clone the repository
rm -rf triton
mkdir triton && cd triton
git clone --recurse-submodules https://github.com/triton-inference-server/server.git
cd server

# Checkout the desired Triton version
git checkout "r${TRITON_VERSION}"

# Build the Triton Inference Server
sudo python3 build.py \
    --build-parallel 10 \
    --no-force-clone \
    --target-platform igpu \
    --target-machine aarch64 \
    --filesystem s3 \
    --enable-gpu \
    --enable-mali-gpu \
    --enable-metrics \
    --enable-logging \
    --enable-stats \
    --enable-cpu-metrics \
    --enable-nvtx \
    --backend onnxruntime \
    --backend pytorch \
    --backend tensorflow \
    --backend python \
    --backend tensorrt \
    --endpoint http \
    --endpoint grpc \
    --min-compute-capability "5.3" \
    --image "base,nvcr.io/nvidia/${IMAGE_NAME}:${OFFICIAL_MIN_IMAGE_TAG}" \
    --image "gpu-base,nvcr.io/nvidia/${IMAGE_NAME}:${OFFICIAL_MIN_IMAGE_TAG}"

# Tag the image locally without pushing to a registry
docker tag "${IMAGE_NAME}:latest" "${IMAGE_NAME}:${CUSTOM_IMAGE_TAG}"

echo "Docker image '${IMAGE_NAME}:${CUSTOM_IMAGE_TAG}' created successfully."

To run the build script, save the file and make it executable by running the following command:

chmod +x build_triton.sh

Then, run the script by executing the following command:

./build_triton.sh

This will build the Triton Inference Server from source with S3 support. The build process may take several hours to complete.

After the build is complete, verify that the Docker image has been created successfully by running the docker image ls command:

tritonserver   24.08-igpu-s3   0d00f465cfca   4 days ago  9.81GB

I pushed the Docker image to the Docker Hub container registry, making it available for others to use. You can find the image here.

Then we can check it by downloading different models for the Triton. To do this, we need to clone the Triton server GitHub repository:

git clone https://github.com/triton-inference-server/server.git

Then, navigate to the docs/examples directory and run the fetch_models.sh script to download the models:

cd server/docs/examples
./fetch_models.sh

Then check the content of folder using ls -l command:

Now that we have the models, we can run the Triton server using Docker. Run the following command to start the server:

sudo docker run --runtime nvidia  --rm --net=host -v ${PWD}/model_repository:/models tritonserver:24.08-igpu-s3 tritonserver --model-repository=/models  --allow-gpu-metrics=false --strict-model-config=false --exit-on-error=false --strict-readiness=false

This command starts the Triton server with the downloaded models and maps the model_repository folder to the /models directory inside the container

The output of the command should look like this:

E1025 13:11:27.645363 1 tritonserver.cc:2607] "Internal: failed to load all models"
I1025 13:11:27.649423 1 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I1025 13:11:27.650101 1 http_server.cc:4694] "Started HTTPService at 0.0.0.0:8000"
I1025 13:11:27.691704 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

GRPC, HTTP, and metrics services have started successfully.

Next, we need to create the model repository. The model repository is the directory in which you place the AI models you want Triton Inference Server to serve.

Setting Up MinIO with Docker Compose

MinIO is an open-source object storage server designed to store unstructured data (such as photos, videos, log files, backups, and container/VM images).

Before we begin, we need to create a model repository. The model repository is a directory where you will store the AI models that you want Triton Inference Server to serve. In this example, we will use the /mnt/minio_data directory to store our MinIO data.

sudo mkdir -p /mnt/minio_data

Create a new file called docker-compose.yml and add the following contents:

version: '3.8'

services:
  s3service:
    image: quay.io/minio/minio:latest
    command: server --console-address ":9001" /data
    ports:
      - '9000:9000'
      - '9001:9001'
    volumes:
      - /mnt/minio_data:/data
    env_file: minio.env
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:9000/minio/health/live"]
      interval: 10s
      timeout: 5s
      retries: 5

  initialize-s3service:
    image: quay.io/minio/mc
    depends_on:
      s3service:
        condition: service_healthy
    entrypoint: >
      /bin/sh -c '
      /usr/bin/mc alias set s3service http://s3service:9000 "$${MINIO_ROOT_USER}" "$${MINIO_ROOT_PASSWORD}";
      /usr/bin/mc mb s3service/"$${BUCKET_NAME}";
      /usr/bin/mc admin user add s3service "$${ACCESS_KEY}" "$${SECRET_KEY}";
      /usr/bin/mc admin policy attach s3service readwrite --user "$${ACCESS_KEY}";
      exit 0;
      '
    env_file: minio.env

This file defines two services: s3service and initialize-s3service. The s3service service runs the MinIO server, while the initialize-s3service service initializes the MinIO bucket and user. The bucket, like the database, is stored on an external volume in the directory /mnt/minio_data

Next, we need to prepare the environment file minio.env that contains the MinIO credentials and bucket name. Create a new file called minio.env and add the following contents:

MINIO_ROOT_USER=admin
MINIO_ROOT_PASSWORD=password

BUCKET_NAME=model-repo

# Development credentials for storing files locally
ACCESS_KEY=VPP0fkoCyBZx8YU0QTjH
SECRET_KEY=iFq6k8RLJw5B0faz0cKCXeQk0w9Q8UdtaFzHuw4Js

To deploy the services, run the following command:

docker compose up -d

This command will start the services in detached mode.

To verify that the services are running, you can check the output of the docker compose up command:

[+] Running 3/3
 ✔ Network minio_advanced_default                   Created                                                                             0.1s 
 ✔ Container minio_advanced-s3service-1             Created                                                                             0.1s 
 ✔ Container minio_advanced-initialize-s3service-1  Created                                                                             0.1s 
Attaching to initialize-s3service-1, s3service-1
s3service-1             | MinIO Object Storage Server
s3service-1             | Copyright: 2015-2024 MinIO, Inc.
s3service-1             | License: GNU AGPLv3 - https://www.gnu.org/licenses/agpl-3.0.html
s3service-1             | Version: RELEASE.2024-10-13T13-34-11Z (go1.22.8 linux/arm64)
s3service-1             | 
s3service-1             | API: http://172.18.0.2:9000  http://127.0.0.1:9000 
s3service-1             | WebUI: http://172.18.0.2:9001 http://127.0.0.1:9001  
s3service-1             | 
s3service-1             | Docs: https://docs.min.io
initialize-s3service-1  | Added `s3service` successfully.
initialize-s3service-1  | Bucket created successfully `s3service/model-repo`.
initialize-s3service-1  | Added user `VPP0fkoCyBZx8YU0QTjH` successfully.
initialize-s3service-1  | Attached Policies: [readwrite]
initialize-s3service-1  | To User: VPP0fkoCyBZx8YU0QTjH
initialize-s3service-1 exited with code 0

This output shows that the services are running and that the MinIO bucket and user have been created.

To access the MinIO web interface, open a web browser and navigate to http://<server-ip>:9000. Replace <server-ip> with the IP address of your server.

Use the credentials specified in the .env file.

Screen after login. Verify that a bucket was created via the web interface:

Verify that a user was created via the web interface

To upload and download files from MinIO, you need to set up a client using different ways discribed below.

Option 1: Setting up the MinIO Client

To install the MinIO client, run the following command:

curl https://dl.min.io/client/mc/release/linux-arm64/mc \
  --create-dirs \
  -o ~/minio-binaries/mc

This command will download and install the MinIO client.

Make the downloaded file executable:

chmod +x $HOME/minio-binaries/mc
export PATH=$PATH:$HOME/minio-binaries/

To log in to the MinIO client, run the following command:

mc alias set myminio http://localhost:9000 username password

This should output something like:

Added `myminio` successfully.

To test the connection, run the following command:

mc admin info myminio

This command will display information about the MinIO server.

●  localhost:9000
   Uptime: 1 hour 
   Version: 2024-08-17T01:24:54Z
   Network: 1/1 OK 
   Drives: 1/1 OK 
   Pool: 1

┌──────┬────────────────────────┬─────────────────────┬──────────────┐
│ Pool │ Drives Usage           │ Erasure stripe size │ Erasure sets │
│ 1st  │ 52.1% (total: 868 GiB) │ 1                   │ 1            │
└──────┴────────────────────────┴─────────────────────┴──────────────┘

0 B Used, 1 Bucket, 0 Objects
1 drive online, 0 drives offline, EC:0

Run the command from the model_repository directory to copy the files to the bucket.

mc cp --recursive ./model_repository/ myminio/model-repo/

Check the MinIO web interface to verify the upload:

Option 2: Connect with AWS CLI

AWS Command Line Interface (CLI) is a unified tool to manage your AWS services. It works with any S3 compatible cloud storage service like e.g. MinIO.

Download the AWS CLI installation package:

curl -O 'https://awscli.amazonaws.com/awscli-exe-linux-aarch64.zip'

Once downloaded, unzip the package and run the installation script:

unzip awscli-exe-linux-aarch64.zip 
sudo ./aws/install

Verify the installation by running:

aws --version

Output:

aws-cli/2.17.42 Python/3.11.9 Linux/5.15.136-tegra exe/aarch64.ubuntu.22

Run the following command to configure AWS CLI:

aws configure

Output:

AWS Access Key ID [None]: G1OuuopbWD5thjk41EzT
AWS Secret Access Key [None]: xbn2Iseq7c1t5dwzPSBMcPsgf5spqaPGMgEIcQnm
Default region name [None]: us-east-1
Default output format [None]:

Set the default S3 signature version:

aws configure set default.s3.signature_version s3v4

List the buckets:

aws --endpoint-url http://192.168.0.9:9000 s3 ls
2024-08-31 00:25:49 model-repo

Upload a folder to the bucket:

aws s3 cp /home/jetson/Projects/triton/server/docs/examples/model_repository s3://model-repo/ --endpoint-url http://192.168.0.9:9000 --recursive

Deploy Triton Inference Server

To access your Minio S3 object storage, you need to create a secret key. You can do this using kubectl command:

kubectl create secret generic aws-credentials --from-literal=AWS_ACCESS_KEY_ID=VPP0fkoCyBZx8YU0QTjH --from-literal=AWS_SECRET_ACCESS_KEY=iFq6k8RLJw5B0faz0cKCXeQk0w9Q8UdtaFzHuw4Js

Replace the placeholder values with your access credentials in the .env file.

To deploy the Triton Inference Server, you need to create a Kubernetes deployment configuration. Here is an example triton_deployment.yaml file:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-deploy
  labels:
    app: triton-app
spec:
  replicas: 3
  selector:
    matchLabels:
      app: triton-app
  template:
    metadata:
      labels:
        app: triton-app
    spec:
      containers:
        - name: triton-container
          ports:
          - containerPort: 8000
            name: http-triton
          - containerPort: 8001
            name: grpc-triton
          - containerPort: 8002
            name: metrics-triton
          image: "shahizat005/tritonserver:24.08-igpu-s3"
          command: ["/bin/bash"]
          args: ["-c", "cp /var/run/secrets/kubernetes.io/serviceaccount/ca.crt /usr/local/share/ca-certificates && update-ca-certificates && /opt/tritonserver/bin/tritonserver --model-store=s3://192.168.0.9:9000/model-repo --strict-model-config=false"]
          env:
          - name: AWS_ACCESS_KEY_ID
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: AWS_ACCESS_KEY_ID
          - name: AWS_SECRET_ACCESS_KEY
            valueFrom:
              secretKeyRef:
                name: aws-credentials
                key: AWS_SECRET_ACCESS_KEY
          livenessProbe:
            failureThreshold: 60
            initialDelaySeconds: 600
            periodSeconds: 5
            httpGet:
              path: /v2/health/live
              port: http-triton
          readinessProbe:
            failureThreshold: 60
            initialDelaySeconds: 600
            periodSeconds: 5
            httpGet:
              path: /v2/health/ready
              port: http-triton

The replica is set to 3. Depending on your requirements, you can change the GPU allocation and replica counts. The livenessProbe and readinessProbe ensure that the containers are alive and ready to serve incoming traffic.

Create the deployment manifest file:

kubectl apply -f triton_deployment.yaml

Check the status of the pods to ensure they are running correctly:

kubectl get pods

The expected sample output is as follows:

NAME                             READY   STATUS    RESTARTS   AGE
triton-deploy-6845554ffd-2fjnh   0/1     Running   0          8m23s
triton-deploy-6845554ffd-97wr2   0/1     Running   0          8m23s
triton-deploy-6845554ffd-stgsc   0/1     Running   0          8m23s

The server logs will display the status of the loaded models and the start-up messages for the gRPC, HTTP, and Metrics services.

| Model                | Version | Status |
+----------------------+---------+--------+
| densenet_onnx        | 1       | READY  |
| inception_graphdef   | 1       | READY  |
| simple               | 1       | READY  |
| simple_dyna_sequence | 1       | READY  |
| simple_identity      | 1       | READY  |
| simple_int8          | 1       | READY  |
| simple_sequence      | 1       | READY  |
| simple_string        | 1       | READY  |
+----------------------+---------+--------+

The container exposes three ports:

8000: HTTP port for health checks and model management.
8001: GRPC port for model inference requests.
8002: Metrics port for monitoring server performance.

I0825 10:34:59.260117 1 grpc_server.cc:2463] "Started GRPCInferenceService at 0.0.0.0:8001"
I0825 10:34:59.260426 1 http_server.cc:4692] "Started HTTPService at 0.0.0.0:8000"
I0825 10:34:59.301991 1 http_server.cc:362] "Started Metrics Service at 0.0.0.0:8002"

Install MetalLB for Load Balancing

Lets install MetalLB and get it to assign IP addresses for Triton services.

kubectl apply -f https://raw.githubusercontent.com/metallb/metallb/v0.14.8/config/manifests/metallb-native.yaml

Once that’s run, we can confirm that MetalLB is running by checking the pods in the metallb-system namespace.

kubectl get pods -n metallb-system

Which should return something similar to.

NAME                          READY   STATUS    RESTARTS   AGE
controller-6dd967fdc7-jxs8l   1/1     Running   0          8m25s
speaker-2rdcj                 1/1     Running   0          8m24s

Then define the address-pools and addresses values as desired.

apiVersion: metallb.io/v1beta1
kind: IPAddressPool
metadata:
  name: first-pool
  namespace: metallb-system
spec:
  addresses:
  - 192.168.0.10-192.168.0.20

Create a Kubernetes service for the Triton Inference Server:

apiVersion: v1
kind: Service
metadata:
  name: triton-service
  labels:
    app: triton-app
spec:
  selector:
    app: triton-app
  type: LoadBalancer
  ports:
    - name: http-triton
      port: 8000
      targetPort: 8000
      nodePort: 30800
      protocol: TCP
    - name: grpc-triton
      port: 8001
      targetPort: 8001
      nodePort: 30801
      protocol: TCP
    - name: metrics-triton
      port: 8002
      targetPort: 8002
      nodePort: 30802
      protocol: TCP

Apply this configuration:

kubectl apply -f triton_service.yaml

For each deployment, a service of type LoadBalancer is created. The Triton Inference Server can be accessed by using the LoadBalancer IP which is in the application network.

Then check using kubectl get svc command:

NAME             TYPE           CLUSTER-IP     EXTERNAL-IP    PORT(S)                                        
kubernetes       ClusterIP      10.43.0.1      <none>         443/TCP                                        
triton-service   LoadBalancer   10.43.138.92   192.168.0.11   8000:30918/TCP,8001:31112/TCP,8002:30934/TCP   5h56m

If the load balancer is set correctly, the Triton deployment should receive an external IP - 192.168.0.11.

Verify Triton Is Running Correctly

First, you need to confirm that the Triton server has started normally and can be accessed remotely. Here are the steps to check the server's status and metrics

To verify the health status of the Triton Inference Server, execute the following curl command:

curl -v 192.168.0.11:8000/v2/health/ready

The expected output should include the "HTTP/1.1 200 OK" status, indicating that the server is running correctly.

*   Trying 192.168.0.11:8000...
* Connected to 192.168.0.11 (192.168.0.11) port 8000 (#0)
> GET /v2/health/ready HTTP/1.1
> Host: 192.168.0.11:8000
> User-Agent: curl/7.81.0
> Accept: */*
> 
* Mark bundle as not supporting multiuse
< HTTP/1.1 200 OK
< Content-Length: 0
< Content-Type: text/plain
< 
* Connection #0 to host 192.168.0.11 left intact

As long as the information that appears later contains the "HTTP/1.1 200 OK" part, everything is ok.

To monitor the server's performance metrics, use the following command:

curl -v 192.168.0.11:8002/metrics

Efficient utilization of GPU's memory

Triton Inference Server provides APIs to manage model loading and unloading, which helps in efficient utilization of GPU resources.

You can call the POST API to load the model

curl -X POST http://192.168.0.11:8000/v2/repository/models/<Model Name>/load

You can call the POST API to unload the model

curl -X POST http://192.168.0.11:8000/v2/repository/models/<Model Name>/unload

Allowing multiple models to share the GPU memory. This can help optimize memory usage and improve performance.

Run Triton Inference Client

To communicate with the Triton server, you can use the provided C++ and Python client libraries.

Clone the Triton Client Repository:

git clone https://github.com/triton-inference-server/client.git

Create a requirements.txt file with the following dependencies:

pillow
numpy
attrdict
tritonclient
google-api-python-client
grpcio
geventhttpclient
boto3

Then, install these dependencies using:

pip3 install -r requirements.txt

Navigate to the client/src/python/examples directory and execute the client script. Here is an example using the image_client.py script:

python3 image_client.py \
-u 192.168.0.11:8000 \
-m inception_graphdef \
-s INCEPTION \
-x 1 \
-c 1 \
test.png

This command will send an inference request to the Triton server and display the results. The output should look something like this:

Request 1, batch size 1
    0.430289 (505) = COFFEE MUG
PASS

Adding Custom YOLO Models to the Triton Inference Engine

It's highly recommended to use the same Docker container for both converting your custom model to the TensorRT engine format and deploying it on Triton. This helps avoid potential incompatibilities that might arise when using different environments.

Run the following command to start the Docker container:

sudo docker run --runtime nvidia -it --ipc=host tritonserver:24.08-igpu-s3

Setting up YOLOv8 for your computer vision projects is a straightforward process. It begins with creating a virtual environment to manage Python dependencies efficiently.

Install the Python virtual environment package using pip:

pip3 install virtualenv 
apt install python3.10-venv

Create a virtual environment within the current directory:

python3 -m venv yolo

Activate the new virtual environment:

source yolo/bin/activate

Install the YOLO library (e.g., ultralytics) and other dependencies within the virtual environment:

pip install ultralytics

Download the pretrained YOLOv8 model:

wget https://github.com/ultralytics/assets/releases/download/v0.0.0/yolov8l.pt

Use YOLO's export function to convert the model into the Open Neural Network Exchange (ONNX) format, which is interoperable with various frameworks:

yolo export model=./yolov8l.pt imgsz=640 format=onnx opset=11

This command exports the yolov8l.pt model to the ONNX format with an image size of 640 pixels and opset level 11.

Once the exporting is complete, we can proceed to the converting to tensorRT format. Use the trtexec tool from the TensorRT installation to convert the ONNX model to a TensorRT engine:

/usr/src/tensorrt/bin/trtexec \
--onnx=yolov8l.onnx \
--saveEngine=yolov8l.engine \
--streams=8

This command converts the yolov8l.onnx model to a TensorRT engine named yolov8l.engine using 8 streams (threads) for improved performance

For faster inference, you can convert the model to use the FP16 (half-precision) data type. Use the following flags with trtexec

/usr/src/tensorrt/bin/trtexec \
--onnx=yolov8l.onnx \
--saveEngine=yolov8l.engine \
--fp16 \
--inputIOFormats=fp16:chw \
--outputIOFormats=fp16:chw \
--streams=8

To be able use TYPE_FP16 as data type for input and output, use --inputIOFormats and --outputIOFormats when parsing the model from ONNX.

To inspect a model using the trtexec command, you can use the --loadEngine

trtexec --loadEngine=model.plan --verbose

Create a directory structure to organize your models and configuration files:

./upload/
├── yolov8l_fp16
│   ├── 1
│   │   └── yolov8l.engine
│   └── config.pbtxt
└── yolov8l_fp32
    ├── 1
    │   └── yolov8l.engine
    └── config.pbtxt

4 directories, 4 files

Under the same model folder, we have a “config.pbtxt” (protobuf text) which describes the model configuration

name: "yolov8l_fp32"
platform: "tensorrt_plan"
max_batch_size : 0
input [
  {
    name: "images"
    data_type: TYPE_FP32 
    dims: [ -1, 3, 640, 640 ]
  }
]
output [
  {
    name: "output0"
    data_type: TYPE_FP32 
    dims: [ -1, 84, 8400 ]
  }
]

Upload the Model to MinIO S3 Model Repository

mc cp --recursive ./upload/ myminio/model-repo/

Then deploy triton

kubectl apply -f triton-deploy.yaml

You will see the following output:

sudo kubectl get pods
NAME                             READY   STATUS    RESTARTS   AGE
triton-deploy-6845554ffd-5rbbj   0/1     Running   0          6s
triton-deploy-6845554ffd-6zjwg   0/1     Running   0          6s
triton-deploy-6845554ffd-kthrg   0/1     Running   0          6s

Check the logs using kubectl logs to verify that the model was loaded successfully:

+----------------------+---------+--------+
| Model                | Version | Status |
+----------------------+---------+--------+
| densenet_onnx        | 1       | READY  |
| inception_graphdef   | 1       | READY  |
| simple               | 1       | READY  |
| simple_dyna_sequence | 1       | READY  |
| simple_identity      | 1       | READY  |
| simple_int8          | 1       | READY  |
| simple_sequence      | 1       | READY  |
| simple_string        | 1       | READY  |
| yolov8l_fp16         | 1       | READY  |
| yolov8l_fp32         | 1       | READY  |
+----------------------+---------+--------+
"Started GRPCInferenceService at 0.0.0.0:8001"
"Started HTTPService at 0.0.0.0:8000"
"Started Metrics Service at 0.0.0.0:8002"

All models were loaded successfully. Once deployed, use the IP address and the following command to call the inference: Run Inference using below Python Code

from ultralytics.utils import ROOT, yaml_load
from ultralytics.utils.checks import check_yaml

import time
import cv2
import numpy as np
from PIL import Image

import tritonclient.http as httpclient


client = httpclient.InferenceServerClient(url="192.168.0.11:8000")

CLASSES = yaml_load(check_yaml('coco128.yaml'))['names']
colors = np.random.uniform(0, 255, size=(len(CLASSES), 3))


def draw_bounding_box(img, class_id, confidence, x, y, x_plus_w, y_plus_h):
    label = f'{CLASSES[class_id]} ({confidence:.2f})'
    color = colors[class_id]
    cv2.rectangle(img, (x, y), (x_plus_w, y_plus_h), color, 2)
    cv2.putText(img, label, (x - 10, y - 10), cv2.FONT_HERSHEY_DUPLEX, 0.7, color, 1)


def main():
    
    # Pre Processing
    original_image = cv2.imread("./test.jpg")
    or_copy = original_image.copy() 
    
    [height, width, _] = original_image.shape
    length = max((height, width))
    scale = length / 640 
    
    image = np.zeros((length, length, 3), np.uint8)
    image[0:height, 0:width] = original_image 
    resize = cv2.resize(image, (640, 640))
    
    img = resize[np.newaxis, :, :, :] / 255.0  
    img = img.transpose((0, 3, 1, 2)).astype(np.float32)

    
    inputs = httpclient.InferInput("images", img.shape, datatype="FP32")
    inputs.set_data_from_numpy(img, binary_data=True)  
    outputs = httpclient.InferRequestedOutput("output0", binary_data=True)

    
    # Inference
    start_time = time.time()
    res = client.infer(model_name="yolov8l", inputs=[inputs], outputs=[outputs]).as_numpy('output0')
    end_time = time.time()
    
    inf_time = (end_time - start_time)
    print(f"inference time: {inf_time*1000:.3f} ms")
    
    
    # Post Processing
    outputs = np.array([cv2.transpose(res[0].astype(np.float32))])
    rows = outputs.shape[1]

    boxes = []
    scores = []
    class_ids = []
    for i in range(rows):
        classes_scores = outputs[0][i][4:]
        (minScore, maxScore, minClassLoc, (x, maxClassIndex)) = cv2.minMaxLoc(classes_scores)
        if maxScore >= 0.25:
            box = [
                outputs[0][i][0] - (0.5 * outputs[0][i][2]), outputs[0][i][1] - (0.5 * outputs[0][i][3]),
                outputs[0][i][2], outputs[0][i][3]]
            boxes.append(box)
            scores.append(maxScore)
            class_ids.append(maxClassIndex)

    result_boxes = cv2.dnn.NMSBoxes(boxes, scores, 0.25, 0.45, 0.5)

    detections = []
    for i in range(len(result_boxes)):
        index = result_boxes[i]
        box = boxes[index]
        detection = {
            'class_id': class_ids[index],
            'class_name': CLASSES[class_ids[index]],
            'confidence': scores[index],
            'box': box,
            'scale': scale}
        detections.append(detection)
        draw_bounding_box(or_copy, class_ids[index], scores[index], round(box[0] * scale), round(box[1] * scale),
                          round((box[0] + box[2]) * scale), round((box[1] + box[3]) * scale))
    # Save the output image to a file
    cv2.imwrite("detection_result.jpg", or_copy)
    print("Result saved as detection_result.jpg")

    return or_copy


if __name__ == "__main__":
    result_image = main()

Inference time for FP32 based precision model:

inference time: 210.176 ms

To test FP16, modify the code to use FP16 data types and run inference again:

img = img.transpose((0, 3, 1, 2)).astype(np.float16)
inputs = httpclient.InferInput("images", img.shape, datatype="FP16")
res = client.infer(model_name="yolov8l_fp16", inputs=[inputs], outputs=[outputs]).as_numpy('output0')

Inference time for FP16 model:

inference time: 131.801 ms

Here are two example use cases that demonstrate the model's object detection capabilities. The first example uses an image of a black and white ceramic tea cup with a saucer on a black wooden table.

📷 Source: https://www.pexels.com/photo/black-and-white-ceramic-tea-cup-with-saucer-on-black-wooden-table-129209/

The model successfully detects the tea cup and the dining table in the image, as evident from the detection result with the bounding box.

The second example uses an image of a light workplace with a laptop.

📷 Source: https://www.pexels.com/photo/light-workplace-with-laptop-in-office-6177607/

These examples were used to demonstrate the YOLO model's ability to detect objects.

For video inference, you can use the following code snippet:

from ultralytics.utils import ROOT, yaml_load
from ultralytics.utils.checks import check_yaml

import time
import cv2
import numpy as np
from PIL import Image

import tritonclient.http as httpclient

client = httpclient.InferenceServerClient(url="192.168.0.10:8000")

CLASSES = yaml_load(check_yaml('coco128.yaml'))['names']
colors = np.random.uniform(0, 255, size=(len(CLASSES), 3))

def draw_bounding_box(img, class_id, confidence, x, y, x_plus_w, y_plus_h):
    label = f'{CLASSES[class_id]} ({confidence:.2f})'
    color = colors[class_id]
    cv2.rectangle(img, (x, y), (x_plus_w, y_plus_h), color, 2)
    cv2.putText(img, label, (x, y - 10), cv2.FONT_HERSHEY_SIMPLEX, 0.5, color, 2)

def process_frame(frame):
    # Pre-process each frame
    original_image = frame
    or_copy = original_image.copy()
    
    height, width, _ = original_image.shape
    length = max(height, width)
    scale = length / 640 
    
    image = np.zeros((length, length, 3), np.uint8)
    image[0:height, 0:width] = original_image 
    resize = cv2.resize(image, (640, 640))
    
    img = resize[np.newaxis, :, :, :] / 255.0  
    img = img.transpose((0, 3, 1, 2)).astype(np.float32)

    # Inference
    inputs = httpclient.InferInput("images", img.shape, datatype="FP32")
    inputs.set_data_from_numpy(img, binary_data=True)
    outputs = httpclient.InferRequestedOutput("output0", binary_data=True)

    start_time = time.time()
    res = client.infer(model_name="yolov8l", inputs=[inputs], outputs=[outputs]).as_numpy('output0')
    end_time = time.time()

    inf_time = (end_time - start_time)
    print(f"inference time: {inf_time*1000:.3f} ms")

    # Post-process
    outputs = np.array([cv2.transpose(res[0].astype(np.float32))])
    rows = outputs.shape[1]

    boxes = []
    scores = []
    class_ids = []
    for i in range(rows):
        classes_scores = outputs[0][i][4:]
        (minScore, maxScore, minClassLoc, (x, maxClassIndex)) = cv2.minMaxLoc(classes_scores)
        if maxScore >= 0.25:
            box = [
                outputs[0][i][0] - (0.5 * outputs[0][i][2]), outputs[0][i][1] - (0.5 * outputs[0][i][3]),
                outputs[0][i][2], outputs[0][i][3]]
            boxes.append(box)
            scores.append(maxScore)
            class_ids.append(maxClassIndex)

    result_boxes = cv2.dnn.NMSBoxes(boxes, scores, 0.25, 0.45, 0.5)

    for i in range(len(result_boxes)):
        index = result_boxes[i]
        box = boxes[index]
        draw_bounding_box(
            or_copy,
            class_ids[index],
            scores[index],
            round(box[0] * scale),
            round(box[1] * scale),
            round((box[0] + box[2]) * scale),
            round((box[1] + box[3]) * scale)
        )

    return or_copy

def main():
    input_video = "video1.mp4"  # Input video file path
    output_video = "detection_result.mp4"  # Output video file path

    cap = cv2.VideoCapture(input_video)
    width = int(cap.get(cv2.CAP_PROP_FRAME_WIDTH))
    height = int(cap.get(cv2.CAP_PROP_FRAME_HEIGHT))
    fps = cap.get(cv2.CAP_PROP_FPS)

    # Define the codec and create VideoWriter object
    fourcc = cv2.VideoWriter_fourcc(*'mp4v')
    out = cv2.VideoWriter(output_video, fourcc, fps, (width, height))

    while cap.isOpened():
        ret, frame = cap.read()
        if not ret:
            break

        # Process each frame and write to output video
        processed_frame = process_frame(frame)
        out.write(processed_frame)

    # Release everything
    cap.release()
    out.release()
    print(f"Detection result saved as {output_video}")

if __name__ == "__main__":
    main()

Here's an example video demonstrating object detection with multiple objects using the Ultralytics YOLOv8 model.

📷 Source: https://www.pexels.com/video/time-lapse-footage-of-a-busy-street-in-a-city-at-daytime-2675512/

This implementation serves as a starting point for developers looking to integrate different models into their video analytics applications on the edge. Finally, I want to conclude that I explored the way to set up and run Nvidia Triton Inference Server using K3s and MinIO on the NVIDIA Jetson AGX Orin 64GB Developer Kit.

I hope you found this guide useful and thanks for reading. If you have any questions or feedback? Leave a comment below. If you like this post, please support me by subscribing to my blog.

Thanks and acknowledgements

Video footage courtesy of Pexels.com(Mixkit-Free Video Assets link)
Photos courtesy of Pexels.com(Nao Triponez link and Cup of Couple link)
Thanks to Andreas Schliebitz for his help with the Triton Docker image.

References

Nurgaliyev Shakhizat

71 projects • 185 followers

I am a hardcore robotics and IoT enthusiast. Email: shahizat005@gmail.com

Contact

Comments

Please log in or sign up to comment.

Embed the widget on your own site

Triton Inference Server on Nvidia Jetson using K3s and MinIO

Triton Inference Server on Nvidia Jetson using K3s and MinIO

Things used in this project

Hardware components

Story

Set up the K3s on the NVIDIA Jetson AGX Orin 64GB Developer Kit

Building Triton Inference Server from Source with S3 Support

Setting Up MinIO with Docker Compose

Option 1: Setting up the MinIO Client

Option 2: Connect with AWS CLI

Deploy Triton Inference Server

Install MetalLB for Load Balancing

Verify Triton Is Running Correctly

Efficient utilization of GPU's memory

Run Triton Inference Client

Adding Custom YOLO Models to the Triton Inference Engine

Thanks and acknowledgements

References

Credits

Nurgaliyev Shakhizat

Comments

Related channels and tags