Vision language models (VLMs) are a powerful new class of AI models that can understand and process both visual and textual information. This capability makes them ideal for a wide range of applications, from image captioning and visual question answering to robotic control and autonomous vehicles.
One of the latest advancements in VLMs is the Microsoft Phi-3.5-Vision model. This highly compressed and optimized model is designed to run on edge devices with limited memory and computational resources, making it perfect for real-world applications.
This tutorial will guide you through the process of running Microsoft Phi-3.5-Vision on the NVIDIA Jetson AGX Orin 64GB Developer Kit using ONNX Runtime GenAI (link) and TensorRT-LLM(link).
Running Phi-3.5-vision via ONNX Runtime GenAISpecial thanks to Kunal Vaishnavi from Microsoft for his assistance in getting Phi-3.5-vision up and running on the NVIDIA Jetson AGX Orin 64GB Developer Kit.
ONNX Runtime is an open-source framework that allows you to run machine learning models in a variety of environments, including edge devices. To begin, we'll install ONNX Runtime and build ONNX Runtime GenAI:
Download and install the ONNX Runtime.whl file
pip install http://108.39.248.12:81/jp6/cu126/+f/0c4/18beb3326027d/onnxruntime_gpu-1.20.0-cp310-cp310-linux_aarch64.whl
Clone ONNX Runtime GenAI and prepare folders
git clone https://github.com/microsoft/onnxruntime-genai
cd onnxruntime-genai
mkdir -p ort/include/
mkdir -p ort/lib/
Find where the.whl is installed
Name: onnxruntime-gpu
Version: 1.20.0
Summary: ONNX Runtime is a runtime accelerator for Machine Learning models
Home-page: https://onnxruntime.ai
Author: Microsoft Corporation
Author-email: onnxruntime@microsoft.com
License: MIT License
Location: /home/jetson/.local/lib/python3.10/site-packages
Requires: coloredlogs, flatbuffers, numpy, packaging, protobuf, sympy
Required-by:
Copy shared libraries to ort/lib/
cp /home/jetson/.local/lib/python3.10/site-packages/onnxruntime/capi/libonnxruntime*.so* ort/lib/
Download header files to copy to ort/include/ folder
cd ort/include/
wget https://raw.githubusercontent.com/microsoft/onnxruntime/rel-1.20.0/include/onnxruntime/core/session/onnxruntime_c_api.h
wget https://raw.githubusercontent.com/microsoft/onnxruntime/rel-1.20.0/include/onnxruntime/core/session/onnxruntime_float16.h
Create a symbolic link:
ln -s /home/jetson/Projects/onnxruntime-genai/ort/lib/libonnxruntime.so.1.20.0 /home/jetson/Projects/onnxruntime-genai/ort/lib/libonnxruntime.so
Build ONNX Runtime GenAI from source using the following command, specifying the appropriate CUDA version and paths:
python3 build.py --use_cuda --cuda_home /usr/local/cuda-12.6 --config Release --ort_home ./ort --skip_tests --parallel
The build process should output the following:
[100%] Built target onnxruntime-genai
[100%] Linking CUDA shared module onnxruntime_genai.cpython-310-aarch64-linux-gnu.so
lto-wrapper: warning: using serial compilation of 6 LTRANS jobs
Copying files to wheel directory: /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel/onnxruntime_genai
[100%] Built target python
[100%] Building wheel on /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
Processing /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
Preparing metadata (setup.py) ... done
Building wheels for collected packages: onnxruntime-genai-cuda
Building wheel for onnxruntime-genai-cuda (setup.py) ... done
Created wheel for onnxruntime-genai-cuda: filename=onnxruntime_genai_cuda-0.6.0.dev0-cp310-cp310-linux_aarch64.whl size=3620020 sha256=2f8cc34c586536e0090a79bf0791476483b8e272289019abfd2407c40696e417
Stored in directory: /tmp/pip-ephem-wheel-cache-zk6pno6c/wheels/81/66/19/2f6df378bb018e430d402b3aa78c90c3d8f151cdace18340e1
Successfully built onnxruntime-genai-cuda
[100%] Built target PyPackageBuild
Navigate to the directory where the.whl file is located and install it using pip:
cd /home/jetson/Projects/onnxruntime-genai/build/Linux/Release/wheel
pip3 install *.whl
Use the following Python command to verify if the installation was successful:
python3 -c 'import onnxruntime_genai; print(onnxruntime_genai.Model.device_type)'
This should output:
<property object at 0xffff8d395120>
Download the onnx version of the Phi-3.5-Vision model.
huggingface-cli download microsoft/Phi-3.5-vision-instruct-onnx --local-dir ./Phi-3.5-vision-instruct-onnx
Download the example script
wget https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py
Run the inference using the below command:
python phi3v.py -m ./gpu/gpu-int4-rtn-block-32/ -p cuda
This command will execute the Phi-3.5-Vision model and generate a caption for a given image.
Demonstrating the capabilities of Phi-3.5-Vision using ONNX Runtime GenAI.
TensorRT-LLM is another option for running VLMs on edge devices, providing a high-performance inference engine specifically designed for large language models on NVIDIA GPUs.
Running Phi-3.5-Vision via TensorRT-LLMYou can easily get started with TensorRT-LLM by following the documentation on the Nvidia Jetson AI Lab page: https://www.jetson-ai-lab.com/tensorrt_llm.html
To run Microsoft Phi-3.5-Vision on the NVIDIA Jetson AGX Orin using TensorRT-LLM, follow these steps:
First, download the Phi-3.5-Vision model using the Hugging Face CLI. Run the following command to download the model to a local directory:
huggingface-cli download microsoft/Phi-3.5-vision-instruct --local-dir ./Phi-3.5-vision-instruct
Edit the config.json file and change the attn_implementation parameter from flash_attention_2 to eager.
Then, convert the checkpoint using the convert_checkpoint.py script:
python3 ./TensorRT-LLM/examples/phi/convert_checkpoint.py \
--model_dir ./Phi-3-vision-128k-instruct \
--output_dir ./Phi-3-vision-128k-instruct-convert \
--dtype float16
Next, build the TensorRT engine using the trtllm-build command:
trtllm-build \
--checkpoint_dir ./Phi-3-vision-128k-instruct-convert \
--output_dir ./Phi-3-vision-128k-instruct-engine \
--gpt_attention_plugin float16 \
--gemm_plugin float16 \
--max_batch_size 1 \
--max_input_len 4096 \
--max_seq_len 4608 \
--max_multimodal_len 4096
This command generates TensorRT engines for the visual components and combines everything into a final pipeline.
Then, build the visual engine using the build_visual_engine.py script:
python3 ./TensorRT-LLM/examples/multimodal/build_visual_engine.py \
--model_type phi-3-vision \
--model_path ./Phi-3-vision-128k-instruct \
--output_dir ./Phi-3-vision-128k-instruct-vision_encoder
Finally, run the model using the run.py script:
python3 ./TensorRT-LLM/examples/multimodal/run.py \
--hf_model_dir ./Phi-3.5-vision-instruct \
--visual_engine_dir ./Phi-3.5-vision-instruct-vision_encoder \
--llm_engine_dir ./Phi-3.5-vision-instruct-engine \
--image_path=./test3.jpg \
--input_text "Describe the image"
This command executes the Phi-3.5-Vision model on the specified input image and generates a caption, demonstrating the model's capability to understand and describe visual content.
Example #1: Using an image of a road signage, the model generates a caption describing the scene and OCR (Optical Character Recognition) functionality.
Output
Example #2: With an image of two retriever puppies, the model produces a caption describing the image content.
Output:
Both methods, via ONNX Runtime and TensorRT-LLM demonstrated decent inference speeds, highlighting the efficiency of running Phi-3.5-Vision on the NVIDIA Jetson AGX Orin.
This enables you to develop innovative edge computing applications that combine visual and language understanding. Whether you're working on image captioning, visual question answering, or other creative projects, Phi-3.5-Vision provides a versatile foundation for unlocking the potential of vision language models at the edge.
I hope you found this guide useful and thanks for reading. If you have any questions or feedback? Leave a comment below. If you like this post, please support me by subscribing to my blog.
References:
Comments
Please log in or sign up to comment.