Most filesystems can only search files by their names. However, not all filenames are meaningful. For example, photos taken by phones or cameras are usually named by ascending numbers or the date taken. It is generally not easy to find a photo that we want only with those filenames. With the help AI models, we can encode photos and the description of the photos to points in a latent space, where points that are close to each other can have the similar meanings. We can search for photos with the description of the photos by encoding the description to points, find the nearest neighbor of the point in the latent space and retrieve the photo that the neighbor is encoded by.
AI models usually require much compute resources to run. It may take too much CPU resources or power for a personal computer to run AI models. With the help of the NPU embedded in AMD 7940HS, we can offload some of the computation required by AI models to NPU to release the burden of CPU.
The source code of this project is published on GitHub. More details about how to run the code and reproduce this project can be found in the README of the repository. It can be found in the attachment as well.
Step 1: Choose ModelThere are many models that is capable for this project. We chose Restnet50 variant of CLIP, proposed by OpenAI, as our base model, because it is efficient and has typical model architecture shared with many other models. Therefore, this project can be easily generalized to use many other models. Luckily, we found a project on GitHub with similar idea as this project. We use it as a base project and extend it to use AMD NPU for inferencing the models.
Step 2: Figure out what is doneOur base project contains two components. One is to index the existing files to a database, that can be queried efficiently when we search for a particular file, and another works as a frontend for users to enter the text description and query the database we created. We chose to use FAISS, A library for efficient similarity search and clustering of dense vectors, introduced by Facebook. CLIP encodes image and text into a vector represents a point in a high-dimension space. If the points are closer to each other, then the image or text inputs that generate points are more likely to have the similar meaning.
The base project runs the models using PyTorch, so the models are executed either on CPU or GPU, based the hardware environment. We need to change the way the models are executed and utilize the NPU we have in out CPU.
Step 3: Quantize Resnet50 & export ONNX modelThe CLIP model from OpenAI is originally a PyTorch model. According to the tutorial provided by AMD Ryzen AI, it is recommended to deploy models in ONNX format. While inferencing the model, CLIP can be separate into two independent models, image encoder and text encoder.
Models are required to be quantized before they can run efficiently on NPU. AMD provides several different ways to quantize a model. One way is to use Vitis AI Quantizer, provided by Ryzen AI, to easily quantize models in ONNX format. Unfortunately, the quantization tool is not effective to models other than CNN models, which is one of the reasons we chose the Resnet50 variant of CLIP. Another way is to use dynamic quantization provided by PyTorch.
Resnet50 variant of CLIP contains a classic Resnet50 model in its image encoding model, followed by an attention pool. We tried to directly quantize the image encoding model with the quantizer following the official tutorial, but the quantized model failed to run on NPU with a bunch of errors. Therefore, we need to further separate the model in to two parts, pure Resnet50 and an attention pool, and only quantize the Resnet50 part of the model with Vitis AI Quantizer.
Step 4: Run Resnet50 part of the modelsThe output of the Vitis AI Quantizer is a quantized model which can be directly executed by ONNX runtime with VitisAIExecutionProvider.
Step 5: Quantize and run the rest of the modelsNow we need to quantize the rest of the image encoding model and the text encoding model. The rest of the models are Transformer based architectures, which is consists of many GEMM operations. According to the code in example on GitHub, though the code cannot run, we found out that we can directly launch the GEMM kernel from Python. Therefore, we can quantize the GEMM operations of the models with dynamic quantization provided by PyTorch and offload the quantized operation to NPU. Although not the entire models run on NPU, we at least execute the GEMM operation, which usually cost large portion of time with NPU. Since attention pool in image encoder is relatively small and simple, we leave it to run on CPU. We only quantized the text encoder in this project.
The ONNX model we got from Step 3 does not directly contains GEMM operations. We need to simplify the model and fuse the Matmul+Add operations to GEMM operation, by surrounding the Matmul+Add with two Reshape operations. Finally, we convert the simplified model to PyTorch for dynamic quantization and export.
Now, we can execute the image encoding model in two steps, Resnet50 with ONNX runtime and the rest of it with PyTorch offload, and the text encoding model completely based on PyTorch offload. The last thing we need to do is to integrate out execution of models to the original base project. We replace the original function call to official OpenAI CLIP PyTorch model with execution of the quantized ONNX and modified PyTorch model.
Usage & ResultThe usage of this project can be found in the README.md of the project repo.
Future StepsThis project is just a demo of the idea and can be improved in many aspects.
- The preprocessing of the images runs on CPU, which can be merge into Resnet50, if the size of the input images is constant.
- The GEMM operation in attention pool of image encoder is not offloaded to NPU.
- The Transformer based models can be converted to ONNX as well, after AMD resolve this issue.
Comments