In this tutorial, I'll guide you through the process of deploying a Large Language Model (LLM) inference engine on the NVIDIA Jetson AGX Orin Developer Kit using the microk8s. It offers a simplified approach to deploying and managing Kubernetes clusters, making it ideal for various applications, including those involving LLMs.
To ensure optimal performance and fault tolerance, I've configured a load balancer to distribute incoming traffic across multiple replica pods. This approach enhances the system's capacity to handle concurrent API requests, preventing bottlenecks and ensuring a seamless user experience.
The following diagram shows the target infrastructure that you will obtain after following this blog post.
The load balancer handles load balancing between input requests from external clients and multiple MLC-LLM serving pod replicas deployed on the Nvidia Jetson Orin Developer kit. Each serving replica is configured to load the model. The number of replicas can range from 1 to N, depending on various factors such as model parameters, precision, and GPU memory consumption. For this purpose, I will be using three replica pods.
Install a single node cluster of Kubernetes using MicroK8sMicroK8s is the easiest and fastest way to get Kubernetes up and running. MicroK8s, developed by Canonical, the creators of Ubuntu, is a compact and efficient Kubernetes distribution designed to streamline the deployment and management of containerized applications on a single node. MicroK8s serves as an excellent learning tool for those diving into the world of Kubernetes.
For the LLM deployment, first, we need to set up the Kubernetes cluster and ensure it is running smoothly. The installation of Microk8s is quite simple.
First, install microK8s using the snap package manager. Run the following command in your terminal:
sudo snap install microk8s --classic --channel=latest/stable
You should see the following output confirming the installation:
microk8s v1.31.0 from Canonical✓ installed
To manage microK8s without needing to use sudo, add your current user to the microK8s group and adjust the ownership of the.kube directory:
sudo usermod -a -G microk8s $USER
sudo chown -f -R $USER ~/.kube
To simplify command usage, create an alias for kubectl to omit the microk8s prefix:
alias kubectl='microk8s kubectl'
echo "alias kubectl='microk8s kubectl'" > ~/.bash_aliases
Open the configuration file to add support for the ARM64 architecture and enable GPU support:
sudo nano /var/snap/microk8s/common/addons/core/addons.yaml
Add the following lines under the existing configurations:
- name: "gpu"
description: "Automatic enablement of Nvidia CUDA"
version: "1.11.0"
check_status: "daemonset.apps/nvidia-device-plugin-daemonset"
supported_architectures:
- amd64
- arm64
Now, enable the GPU add-on:
sudo microk8s enable gpu
You should see output indicating that the GPU and related services are being enabled:
Infer repository core for addon gpu
Enabling NVIDIA GPU
Enabling DNS
Applying manifest
serviceaccount/coredns created
configmap/coredns created
deployment.apps/coredns created
service/kube-dns created
clusterrole.rbac.authorization.k8s.io/coredns created
clusterrolebinding.rbac.authorization.k8s.io/coredns created
Restarting kubelet
DNS is enabled
Enabling Helm 3
Fetching helm version v3.8.0.
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 11.7M 100 11.7M 0 0 2289k 0 0:00:05 0:00:05 --:--:-- 3212k
Helm 3 is enabled
Checking if NVIDIA driver is already installed
GPU 0: Orin (nvgpu) (UUID: 36baf986-26a8-5222-9d8b-823b8dff81c6)
Using host GPU driver
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/5874/credentials/client.config
"nvidia" has been added to your repositories
WARNING: Kubernetes configuration file is group-readable. This is insecure. Location: /var/snap/microk8s/5874/credentials/client.config
NAME: gpu-operator
LAST DEPLOYED: Tue Aug 13 21:55:41 2024
NAMESPACE: gpu-operator-resources
STATUS: deployed
REVISION: 1
TEST SUITE: None
NVIDIA is enabled
Next, modify the container configuration file to ensure it uses the NVIDIA runtime. Open the file:
sudo nano /var/snap/microk8s/current/args/containerd-template.toml
Replace the contents with the following configuration:
# Use config version 2 to enable new configuration fields.
version = 2
oom_score = 0
[grpc]
uid = 0
gid = 0
max_recv_message_size = 16777216
max_send_message_size = 16777216
[debug]
address = ""
uid = 0
gid = 0
[metrics]
address = "127.0.0.1:1338"
grpc_histogram = false
[cgroup]
path = ""
# The 'plugins."io.containerd.grpc.v1.cri"' table contains all of the server options.
[plugins."io.containerd.grpc.v1.cri"]
stream_server_address = "127.0.0.1"
stream_server_port = "0"
enable_selinux = false
sandbox_image = "registry.k8s.io/pause:3.7"
stats_collect_period = 10
enable_tls_streaming = false
max_container_log_line_size = 16384
# 'plugins."io.containerd.grpc.v1.cri".containerd' contains config related to containerd
[plugins."io.containerd.grpc.v1.cri".containerd]
snapshotter = "${SNAPSHOTTER}"
no_pivot = false
default_runtime_name = "nvidia" # Set to nvidia by default
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
runtime_type = "${RUNTIME_TYPE}"
# Consolidated NVIDIA runtime configuration
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia]
runtime_type = "io.containerd.runc.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.nvidia.options]
BinaryName = "/usr/bin/nvidia-container-runtime"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata]
runtime_type = "io.containerd.kata.v2"
[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.kata.options]
BinaryName = "kata-runtime"
Now, enable additional services, including DNS, hostpath-storage, ingress, and MetalLB:
sudo microk8s enable dns hostpath-storage ingress metallb:192.168.0.10-192.168.0.16
MetalLB is a load-balancer implementation for bare-metal Kubernetes clusters using standard routing protocols. It will ask to input IP address range allocated for load balancers. It is good to assign a small IP address pool. In Layer 2 mode, MetalLB uses a simple round-robin algorithm to distribute traffic across the endpoints of a service.
You should see output similar to the following as these services are enabled:
Infer repository core for addon dns
Infer repository core for addon hostpath-storage
Infer repository core for addon ingress
Infer repository core for addon metallb
Addon core/dns is already enabled
Enabling default storage class.
WARNING: Hostpath storage is not suitable for production environments.
deployment.apps/hostpath-provisioner created
storageclass.storage.k8s.io/microk8s-hostpath created
serviceaccount/microk8s-hostpath created
clusterrole.rbac.authorization.k8s.io/microk8s-hostpath created
clusterrolebinding.rbac.authorization.k8s.io/microk8s-hostpath created
Storage will be available soon.
Enabling Ingress
ingressclass.networking.k8s.io/public created
namespace/ingress created
serviceaccount/nginx-ingress-microk8s-serviceaccount created
clusterrole.rbac.authorization.k8s.io/nginx-ingress-microk8s-clusterrole created
role.rbac.authorization.k8s.io/nginx-ingress-microk8s-role created
clusterrolebinding.rbac.authorization.k8s.io/nginx-ingress-microk8s created
rolebinding.rbac.authorization.k8s.io/nginx-ingress-microk8s created
configmap/nginx-load-balancer-microk8s-conf created
configmap/nginx-ingress-tcp-microk8s-conf created
configmap/nginx-ingress-udp-microk8s-conf created
daemonset.apps/nginx-ingress-microk8s-controller created
Ingress is enabled
Enabling MetalLB
Applying Metallb manifest
namespace/metallb-system created
secret/memberlist created
Warning: policy/v1beta1 PodSecurityPolicy is deprecated in v1.21+, unavailable in v1.25+
podsecuritypolicy.policy/controller created
podsecuritypolicy.policy/speaker created
serviceaccount/controller created
serviceaccount/speaker created
clusterrole.rbac.authorization.k8s.io/metallb-system:controller created
clusterrole.rbac.authorization.k8s.io/metallb-system:speaker created
role.rbac.authorization.k8s.io/config-watcher created
role.rbac.authorization.k8s.io/pod-lister created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:controller created
clusterrolebinding.rbac.authorization.k8s.io/metallb-system:speaker created
rolebinding.rbac.authorization.k8s.io/config-watcher created
rolebinding.rbac.authorization.k8s.io/pod-lister created
Warning: spec.template.spec.nodeSelector[beta.kubernetes.io/os]: deprecated since v1.14; use "kubernetes.io/os" instead
daemonset.apps/speaker created
deployment.apps/controller created
configmap/config created
MetalLB is enabled
For all changes to take effect, restart microK8s:
sudo microk8s.stop
sudo microk8s.start
Check the status of microK8s to confirm that it is running correctly:
sudo microk8s status
You should see output indicating that microK8s is running with the enabled add-ons:
microk8s is running
high-availability: no
datastore master nodes: 127.0.0.1:19001
datastore standby nodes: none
addons:
enabled:
dns # (core) CoreDNS
ha-cluster # (core) Configure high availability on the current node
helm3 # (core) Helm 3 - Kubernetes package manager
hostpath-storage # (core) Storage class; allocates storage from host directory
ingress # (core) Ingress controller for external access
metallb # (core) Loadbalancer for your Kubernetes cluster
storage # (core) Alias to hostpath-storage add-on, deprecated
disabled:
community # (core) The community addons repository
dashboard # (core) The Kubernetes dashboard
gpu # (core) Automatic enablement of Nvidia CUDA
helm # (core) Helm 2 - the package manager for Kubernetes
host-access # (core) Allow Pods connecting to Host services smoothly
mayastor # (core) OpenEBS MayaStor
metrics-server # (core) K8s Metrics Server for API access to service metrics
prometheus # (core) Prometheus operator for monitoring and logging
rbac # (core) Role-Based Access Control for authorisation
registry # (core) Private image registry exposed on localhost:32000
Create a simple test application to verify that everything is functioning properly. Use the following YAML configuration:
apiVersion: v1
kind: Pod
metadata:
name: torch
spec:
imagePullSecrets:
- name: my-image-pull-secret
containers:
- name: torchtest
image: dustynv/l4t-pytorch:r36.2.0
securityContext:
privileged: true
command: [ "/bin/bash", "-c", "--" ]
args: [ "while true; do sleep 30; done;" ]
When a pod is created, Kubernetes attempts to pull the container image specified in the pod definition from a container registry. In this case, we will be using pre-built container images from Dustin Franklin's public Docker container registry.
Check the status of the pod to ensure it is running:
NAME READY STATUS RESTARTS AGE
torch 1/1 Running 0 111s
Once the pod is in a running state, you can access the Python terminal within the pod:
kubectl exec -it torch -- python3
Then, run the following commands to verify the GPU is recognized:
>>> import torch
>>> torch.cuda.get_device_name(0)
'Orin'
>>>
You should see output confirming the GPU being used:
If you encounter the following issue while connecting to the server:
Unable to connect to the server: tls: failed to verify certificate: x509: certificate has expired or is not yet valid:
You can resolve this by refreshing the certificates:
sudo microk8s refresh-certs --cert ca.crt
We have successfully established a robust single-node Kubernetes cluster using MicroK8s on the NVIDIA Jetson AGX Orin Developer Kit.
Deploying an LLM Inference Engine with MLC-LLMThere are many popular LLMs inference servers, including TGI, vLLM, and NVIDIA Triton. Here I will be using MLC-LLM. It is an open source framework available in containers dustynv/mlc:0.1.1-r36.3.0
for serving popular LLMs on the NVIDIA Jetson AGX Orin Developer Kit. Thanks to Dustin Franklin from Nvidia. MLC-LLM offers an API compatible with the OpenAI Chat Completion API. Since Llama 3 is the new SOTA (state of the art) for open-source LLMs. The largest model, with 70 billion parameters, is comparable to GPT-3.5 in a number of tasks.
The GPU memory requirement is largely determined by pre-trained LLM size. For example, Llama 3 8B (8 billion parameters) loaded in 16-bit precision requires 8 billion * 2 bytes = 16 GB for the model weights plus additional overhead. Quantization is a technique used to reduce model size and improve inferencing performance by decreasing precision without significantly sacrificing accuracy. In this example, we use the quantization feature to load a model based on Llama 3 8B in 4-bit precision and fit it on the NVIDIA Jetson AGX Orin Developer Kit with 64-GB VRAM. But we can also use the not quantized model q0f16, which is a float16 model without quantization.
Running MLC-LLM on Kubernetes is pretty straightforward. Create a manifest named mlc-llm-deploy.yaml with the following content:
apiVersion: apps/v1
kind: Deployment
metadata:
name: mlc-llm-deploy
labels:
app: mlc-llm-app
spec:
replicas: 3
selector:
matchLabels:
app: mlc-llm-app
template:
metadata:
labels:
app: mlc-llm-app
spec:
containers:
- name: mlc-llm-container
image: dustynv/mlc:0.1.1-r36.3.0
command: ["/bin/bash", "-c", "mlc_llm serve --host 0.0.0.0 --model-lib-path /model/model_lib/4b70b61200d9fc97774a43203a3ac054.so /model/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC"]
ports:
- containerPort: 8000
volumeMounts:
- name: model-storage
mountPath: /model
volumes:
- name: model-storage
hostPath:
path: /home/jetson/Projects/inference_engine
type: Directory
The deployment manifest is responsible for deploying the MLC-LLM-based model, which exposes APIs as shown in the diagram above.
To deploy the manifest, run the following command:
kubectl apply -f mlc-llm-deploy.yaml
This configuration instructs Kubernetes to create three replicated pods for the mlc-llm-app. Each pod runs a container from the dustynv/mlc:0.1.1-r36.3.0 container image with the container's port 8000 exposed.
Run the following command to check the pods:
kubectl get pods
This should output something like:
kubectl get pods
mlc-llm-deploy-6466fb97b5-dlksm 1/1 Running 0 66m
mlc-llm-deploy-6466fb97b5-mwslm 1/1 Running 0 66m
mlc-llm-deploy-6466fb97b5-w92w9 1/1 Running 0 66m
This should show you the status of the inference service:
[2024-10-06 16:20:29] INFO auto_device.py:76: Found device: cuda:0
[2024-10-06 16:20:31] INFO auto_device.py:85: Not found device: rocm:0
[2024-10-06 16:20:34] INFO auto_device.py:85: Not found device: metal:0
[2024-10-06 16:20:36] INFO auto_device.py:85: Not found device: vulkan:0
[2024-10-06 16:20:38] INFO auto_device.py:85: Not found device: opencl:0
[2024-10-06 16:20:38] INFO auto_device.py:33: Using device: cuda:0
[2024-10-06 16:20:38] INFO chat_module.py:379: Using model folder: /model/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC
[2024-10-06 16:20:38] INFO chat_module.py:380: Using mlc chat config: /model/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC/mlc-chat-config.json
[2024-10-06 16:20:38] INFO chat_module.py:529: Using library model: /model/model_lib/4b70b61200d9fc97774a43203a3ac054.so
[2024-10-06 16:20:43] INFO engine_base.py:395: Under mode "local", max batch size is set to 4, max KV cache token capacity is set to 8192, prefill chunk size is set to 2048. We choose small max batch size and KV cache capacity to use less GPU memory.
[2024-10-06 16:20:48] INFO engine_base.py:395: Under mode "interactive", max batch size is set to 1, max KV cache token capacity is set to 8192, prefill chunk size is set to 2048. We fix max batch size to 1 for interactive single sequence use.
[2024-10-06 16:20:53] INFO engine_base.py:395: Under mode "server", max batch size is set to 80, max KV cache token capacity is set to 413260, prefill chunk size is set to 2048. We use as much GPU memory as possible (within the limit of gpu_memory_utilization).
[2024-10-06 16:20:53] INFO engine_base.py:432: The actual engine mode is "local". So max batch size is 4, max KV cache token capacity is 8192, prefill chunk size is 2048.
[2024-10-06 16:20:53] INFO engine_base.py:441: Estimated total single GPU memory usage: 5772.59 MB (Parameters: 4308.13 MB. KVCache: 1112.53 MB. Temporary buffer: 351.93 MB). The actual usage might be slightly larger than the estimated number.
[2024-10-06 16:20:53] INFO engine_base.py:450: Please switch to mode "server" if you want to use more GPU memory and support more concurrent requests. Please override the arguments if you have particular values to set.
INFO: Started server process [1]
INFO: Waiting for application startup.
INFO: Application startup complete.
INFO: Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
INFO: 192.168.0.9:58258 - "POST /v1/chat/completions HTTP/1.1" 200 OK
INFO: 192.168.0.9:33643 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Each MLC-LLM pod is using approximately 5772.59 MB of GPU memory, with the majority allocated to model parameters. The server is running in "local" mode, which limits the batch size and GPU memory usage. If you want to utilize more GPU memory and support more concurrent requests, you can switch to "server" mode.
And we can see that the pod has loaded the model, and now it is ready to start receiving inference calls. To expose services externally, you've seen you can use LoadBalancer services. Load balancers can be used for exposing applications to the external network. The load balancer provides a single IP address to route incoming requests to our MLC-LLM app.
We will define a service to expose our LLM inference engine to the network. Create a file named mlc-llm-service.yaml with the following content:
apiVersion: v1
kind: Service
metadata:
name: mlc-llm-service
labels:
app: mlc-llm-app
spec:
selector:
app: mlc-llm-app
ports:
- protocol: TCP
port: 8000
targetPort: 8000
type: LoadBalancer
With both configuration files created, you can now deploy the service using kubectl. Run the following commands:
kubectl apply -f mlc-llm-service.yaml
Then run the following command to see your LoadBalancer with the external-ip and port.
kubectl get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
kubernetes ClusterIP 10.152.183.1 <none> 443/TCP 36d
mlc-llm-service LoadBalancer 10.152.183.143 192.168.0.10 8000:32061/TCP 13d
Once we have an external IP, we can interact with the LLM inference engine through its API.
For example, using curl, I will be using my MacBook Pro to send REST API requests over a local area network.
curl -X POST \
-H "Content-Type: application/json" \
-d '{
"model": "/model/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
"messages": [
{"role": "user", "content": "Hello! Our project is MLC LLM. What is the name of our project?"}
]
}' \
http://192.168.0.10:8000/v1/chat/completions
This should output something like:
{"id":"chatcmpl-24fedf44dda44c10a51aebea57cdd56e","choices":[{"finish_reason":"stop","index":0,"message":{"content":"Based on what you've shared, I believe our project is the Multilingual Language Model (MLC LLM) - that's correct, isn't it?","role":"assistant","name":null,"tool_calls":null,"tool_call_id":null},"logprobs":null}],"created":1728147061,"model":"/model/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC","system_fingerprint":"","object":"chat.completion","usage":{"prompt_tokens":45,"completion_tokens":32,"total_tokens":77}}
Here is an example of how to send a request to the model using Python with streaming support:
import requests
import json
# Get a response using a prompt with streaming
payload = {
"model": "/model/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
"messages": [{"role": "user", "content": "Who was Albert Einstein?"}],
"stream": True,
}
with requests.post("http://192.168.0.10:8000/v1/chat/completions", json=payload, stream=True) as r:
for chunk in r.iter_content(chunk_size=None):
chunk = chunk.decode("utf-8")
if "[DONE]" in chunk[6:]:
break
response = json.loads(chunk[6:])
content = response["choices"][0]["delta"].get("content", "")
print(content, end="", flush=True)
print("\n")
You should obtain something like that:
Albert Einstein was a renowned German-born physicist who is widely regarded as one of the most influential scientists of the 20th century. He is best known for his theory of relativity and the famous equation E=mc², which revolutionized our understanding of space, time, and energy.
Born on March 14, 1879, in Munich, Germany, Einstein grew up in a middle-class family and was an average student until he was about 12 years old. At that age, he began to excel in his studies, particularly in mathematics and physics. He went on to study physics at the Swiss Federal Polytechnic School, where he graduated in 1900.
After completing his education, Einstein worked as a patent clerk in Bern, Switzerland, which gave him the opportunity to focus on his own research. In 1905, he published several groundbreaking papers, including the theory of Brownian motion, which provided strong evidence for the existence of atoms and molecules. This work earned him the Nobel Prize in Physics in 1921.
In 1915, Einstein developed his theory of general relativity, which posits that gravity is not a force that acts between objects, but rather a curvature of spacetime caused by the presence of massive objects. This theory predicted phenomena such as black holes, gravitational waves, and the bending of light around massive objects.
Einstein's other notable contributions include:
1. The photoelectric effect: He explained the behavior of light in the context of quantum mechanics, which led to his being awarded the Nobel Prize in Physics.
2. The special theory of relativity: He showed that the laws of physics are the same for all observers in uniform motion relative to one another.
3. The unified field theory: He attempted to unify the principles of electromagnetism, gravity, and the strong and weak nuclear forces into a single cohesive framework.
Einstein's work has had a profound impact on various fields, including physics, mathematics, and philosophy. He is also known for his witty sense of humor, his love of sailing, and his commitment to social and political activism. He passed away on April 18, 1955, but his legacy continues to inspire and influence people around the world.
To deploy the Kubernetes dashboard and access it via a web browser, run the following command:
microk8s dashboard-proxy
This will output something like:
Checking if Dashboard is running.
Infer repository core for addon dashboard
Infer repository core for addon metrics-server
Waiting for Dashboard to come up.
Trying to get token from microk8s-dashboard-token
Waiting for secret token (attempt 0)
Dashboard will be available at https://127.0.0.1:10443
Use the following token to login:
eyJhbGciOiJSUzI1NiIsImtpZCI6IjZDQTY5bENoUFcxWUlZRWlrMXhhQlJSbDlmVHVISVdPaFY2dVRvM1E0QlEifQ.eyJpc3MiOiJrdWJlcm5ldGVzL3NlcnZpY2VhY2NvdW50Iiwia3ViZXJuZXRlcy5pby9zZXJ2aWNlYWNjb3VudC9uYW1lc3BhY2UiOiJrdWJlLXN5c3RlbSIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VjcmV0Lm5hbWUiOiJtaWNyb2s4cy1kYXNoYm9hcmQtdG9rZW4iLCJrdWJlcm5ldGVzLmlvL3NlcnZpY2VhY2NvdW50L3NlcnZpY2UtYWNjb3VudC5uYW1lIjoiZGVmYXVsdCIsImt1YmVybmV0ZXMuaW8vc2VydmljZWFjY291bnQvc2VydmljZS1hY2NvdW50LnVpZCI6ImM5ODVmZGNhLTY5ZjYtNDA0ZS1hYTg3LTAwNzc5NmU1NDczMyIsInN1YiI6InN5c3RlbTpzZXJ2aWNlYWNjb3VudDprdWJlLXN5c3RlbTpkZWZhdWx0In0.YSBZdFj6cFPXZoSAOuz38nV6esw1ZGKRTAykk1eoXz8ORsRa0E9T2KlKU7z6r_v9Q-CO4rDfgQsqbDrj5tRU_pKpkUCpX39WN0RfcwL1oi7LnMlzvG68s9aJ5H84jHu5GYvoDgp_qFZKZiZKbEp2Ct1RM9kUdtzKui6PhbZQ27mCE8ikgd6DYRtDhhIWFwWJwn33gR4wJssjJ_MrjrCnyJP4IAr2m2-RJ4kgT-2aw5BVYlY5f6d7m2Z_O_iYzjd35YbWDAU5a8TqmL1cAuPsUWosLN4AojYEmBdxdd4YQt7X0wCMlT9s4SnlD-9xcEJRE9KZLyAesXq1Ih8B1Udf8w
The above command will display a long string of characters with its token. Copy that string and then go back to the web browser.
Open a web browser, enter token and lick SIGN IN and you will find yourself on the Kubernetes Dashboard.
You have successfully deployed an LLM inference engine using microK8s and the MLC-LLM framework on the Nvidia Jetson AGX 64GB Orin Developer Kit.
Let us see it in action with multiple prompts sequentially:
import requests
import json
from datasets import load_dataset
# Load the MMLU Sociology dataset
dataset = load_dataset("cais/mmlu", "sociology")
# Iterate over the dataset and run inference
for item in dataset['test']: # You can choose 'train' or 'validation' if needed
prompt = item['question'] # Assuming 'question' contains the prompt
print(f"Running inference for prompt: {prompt}")
# Prepare the payload
payload = {
"model": "/model/model_weights/mlc-ai/Llama-3-8B-Instruct-q4f16_1-MLC",
"messages": [{"role": "user", "content": prompt}],
"stream": True,
}
# Send the request to the inference endpoint
with requests.post("http://192.168.0.10:8000/v1/chat/completions", json=payload, stream=True) as r:
for chunk in r.iter_content(chunk_size=None):
chunk = chunk.decode("utf-8")
if "[DONE]" in chunk[6:]:
break
response = json.loads(chunk[6:])
content = response["choices"][0]["delta"].get("content", "")
print(content, end="", flush=True)
print("\n") # Newline for better readability between responses
During inference, check the web interface of the Kubernetes dashboard
You can further enhance your project by leveraging tools like Prometheus and Grafana to monitor metrics and set up alerts. As businesses utilize LLMs to develop cutting-edge solutions, the need for scalable, secure, and efficient deployment platforms becomes increasingly crucial. Kubernetes has emerged as the preferred option due to its scalability, flexibility, portability, and resilience. By following the steps outlined above, individuals can efficiently manage their LLM services on edge devices, such as the Nvidia Jetson AGX Orin Developer Kit, ensuring they are robust and ready for production workloads.
I hope this was useful to you. If you have any questions, please do not hesitate to contact me here.
Comments