AutoEncoders are a class of neural networks designed for the task of unsupervised learning. They can be used for image compression, denoising, anomaly detection, generative models, and more. Their core principle is learning a compact representation for input data and reconstructing the input from this compressed representation. This representation of the input data is referred to as the latent embeddings. The process of obtaining the latent embeddings is called encoding, and the process of reconstructing the input data based on these latent embeddings is called decoding. Through the intermediate compression process, we can obtain an efficient representation of the input data, which can reduce the dimensionality of the original data.
With the development of image generation models, an innovative approach in AutoEncoder architecture, the Vector Quantized Variational Autoencoder (VQ-VAE), has come to the forefront. To simplify the introduction of VQ-VAE, we will ignore the variational component and start directly from the AutoEncoder structure. Unlike a standard AutoEncoder, the VQ-VAE introduces a vector quantization module between the encoder and the decoder. In VQ-VAE, the continuous latent representations are quantized to the nearest entry in a codebook of fixed-size vectors. This quantization operation introduces discreteness, enabling more efficient learning and manipulation of the latent space. The codebook vectors, learned during training, effectively serve as a vocabulary for the model, enabling it to capture and reproduce complex visual structures with greater fidelity. In other words, the latent representations are categorized into discrete entries in the codebook, and images are generated based on these discrete representations.
However, one primary challenge of VQ-VAE lies in the gradient estimation during the backpropagation process due to the discrete nature of the quantization step. The quantization operation is inherently non-differentiable, which poses a problem for backpropagation. While VQ-VAE employs tricks to address this issue, they result in a complex and potentially unstable training process.
Considering that the ultimate goal of VQ-VAE is to obtain the indexes in the codebook, the encoding and quantization operations may not be strictly necessary. Building upon the characteristics of natural images, we propose the Color-Shape-Texture AutoEncoder (CST-AE) to overcome the limitations of VQ-VAE.
Typically, an image can be decomposed into features such as color, shape, and texture, ranging from low to high frequencies. In CST-AE, we first isolate the color of the image and assign the corresponding categories of colors to index the learnable vectors in the codebook. Then, we sequentially separate the shape and texture features and obtain the vectors in the codebook. These vectors representing color, shape, and texture are combined to form the discrete latent representation of the image. The decoder can then use these vectors to reconstruct the original image. By separating color, shape, and texture, we eliminate the need for encoder networks in traditional AutoEncoders, thereby avoiding the gradient backpropagation issues in the quantization module during training.
2. Implementation of CST-AutoEncoder2.1. Encoding -- Separating the color, shape, and texture features
Following the approaches of Vision Transformer and ConvNext, we first divide the image into non-overlapping patches, such as 8x8 pixels per patch. Each patch then serves as the basic unit for separating colors, shapes, and textures. From a frequency domain perspective, the colors, shapes, and textures within the patches gradually increase in frequency. Therefore, we begin by separating the color layer first.
As shown in Figure 4, we divide the images into non-overlapping patches and use a K-Means Classifier to generate a category map, i.e., each patch is assigned a color category. By indexing the cluster centers using the category map, we can assign a color to each patch and derive the color feature of the image. Next, we subtract the color feature from the original image to obtain the first-level residual, which serves as the input for extracting the shape and texture features. Alternatively, if we index the learnable variables in the codebook, we obtain the discrete latent embeddings of the image in the color layer.
Now, the question is how to train a K-means classifier. Initially, we select the ImageNet dataset as the training dataset, which contains over 1.28 million images. Then, we divide each image into patches and compute the average pixel value across the three color channels as the color of that patch. Finally, we use the MiniBatchKMeans
algorithm from the sklearn
library to cluster the colors from all patches into 512 distinct categories. The process can be represented in Python as follows:
# python
from sklearn.cluster import MiniBatchKMeans
from einops import rearrange
from timm.data import create_dataset, create_loader
cluster_model=MiniBatchKmeans(n_cluster=512)
set_path='/path/to/your/ImageNet/dataset/'
dataset = create_dataset("ImageFolder",set_path,is_training=False,batch_size=batch_size,split="train",)
loader = create_loader(dataset,(3, img_size, img_size),batch_size,False,use_prefetcher=True,no_aug=True,mean=(0.5, 0.5, 0.5),std=(0.5, 0.5, 0.5),num_workers=num_workers,device=device)
for batch, label in loader:
# cut the image into non-overlap patches
batch=rearrange(batch,'b c (h hp) (w wp)->(b h w) c (hp wp)',hp=8,wp=8)
# calculate the average colors of three channels
batch=batch.mean(-1)
# train the MiniBatchKMeans model
cluster_model=cluster_model.partial_fit(batch)
We visualize the cluster centers after training in Fig.5. Based on the visualization results, we can see that the colors contain a variety of colors from white to black in between, such as red, blue and green. Then we can use this KMeans classifier to classify the color category of each patch.
The clustering of the first-level residual forms a K-means classifier for the shape feature. The clustering process is similar to that of the color, with the key difference being that there is no need to compute the average pixels in each patch. Furthermore, the first-level residuals are subtracted from the original image to isolate the shape features, and the resulting shape features are used to compute the second-level residuals. These second-level residuals are then used to separate the texture feature. The clustering numbers for the shape and texture features are 2048 and 8192, respectively.
With these three classifiers, we can extract the features of the original image in terms of color, shape, and texture. In addition to the CST features, there are also three-levels residuals. Theoretically, this separation process could continue indefinitely; however, for reasons of effectiveness and speed, we directly discard the third-level residuals and retain only the three CST features. From an image compression perspective, the number of representations per patch is reduced from 3x8x8 to 3. In other words, we compress the image to 1/64th of its original size.
2.2. Decoding -- Reconstructing the Image
According to the separation process of the color, shape, and texture features, we determine the category numbers. We can then index the learned embeddings in the codebooks to obtain the discrete latent embeddings for each patch. We require a network that takes discrete latent embeddings as input and produces the image as output. Since the image content is determined by the categories of the color, shape, and texture, the decoding process simply renders the input embeddings. We refer to the decoding modules as the RenderNet.
For the subsequent ease of deployment, we need to consider hardware-supported operations when designing the network. The operators supported by the NPU in the ONNX framework can be found here (Model Compatibility — Ryzen AI Software 1.1 documentation (amd.com)).
The architecture of RenderNet is shown in Figure 8. The feature extraction stage consists of 12 blocks, and each block comprises a 3x3 convolutional layer, a batch normalization layer, a ReLU activation layer, and a 1x1 convolutional layer. The input and output channels of the 3x3 convolutional layer are 256 and 1024, respectively. Conversely, the input and output channels of the 1x1 convolutional layer are 1024 and 256, respectively. The block's python code using the PyTorch framework is as follows:
#python code
import torch
import torch.nn as nn
class Block(nn.Module):
def __init__(self,dim=256,expand_ratio=4)
self.module=n.Sequential(
nn.Conv2d(dim, dim * expand_ratio, 3, 1, 1, bias=False),
nn.BatchNorm2d(dim * expand_ratio),
nn.ReLU(),
nn.Conv2d(dim * expand_ratio, dim, 1, 1, 0),
)
def forward(self,x):
x=self.module(x)+x
return x
After training this network on ImageNet, we get a checkpoint of trained parameters. The checkpoint can be found at the Google Drive.
3. Deployment of CST-AutoEncoder3.1. Hardware and software preparation
- Open the NPU device in Bios and Install NPU driver(Installation Instructions — Ryzen AI Software 1.1 documentation (amd.com))
- Install the Ryzen AI Software (Installation Instructions — Ryzen AI Software 1.1 documentation (amd.com))
- Install wsl2 (Install WSL | Microsoft Learn) and docker (Install Docker Desktop on Windows | Docker Docs).
- Download the checkpoint, code, and datasets from https://drive.google.com/file/d/1pknX4-zAIZlRBdwJbnaJZN2VbH7siUiK/view?usp=sharing
3.2. Deploying encoding stages
The encoding stages consist of three main operations: KMeans classification, indexing, and matrix subtraction. The sklearn
module always uses only one CPU core to predict the categories of patches, which is inefficient. We can use matrix multiplication to accelerate the KMeans classification process. The KMeans classification is based on evaluating the Euclidean distance. In other words, we need to compute the Euclidean distance between each patch and the cluster centers, and identify the center with the smallest distance. The index of the closest center in the cluster centers determines the category of the patch. This process can be represented with Python code.
import torch
import joblib
kmeans_model=joblib.load('/path/to/kmeans/model')
# Predict by sklearn framework
cls=kmeans_model.prtedict(data)
# Predict by matrix multiplication
cluster_centers=torch.from_numpy(kmeans_model.cluster_centers_)
distance=torch.cdist(data, cluster_centers)
cls=distance.argmin(dim=-1)
The matrix multiplication in the PyTorch framework is a parallel, highly optimized operation.
3.3. Deploying RenderNet
The NPU only supports 8-bit operations, so we need to quantize the model first. The PyTorch-Quantization framework from AMD Ryzen-AI is based on the Linux system, while the AI PC runs Windows 11. The best practice is to enable WSL2 on the PC and run Docker within WSL2 for quantization. Next, we describe this process in detail.
3.3.1 Quantization
At first, start the docker desktop.
Then start the WSL terminal in the quantization folder, pull the docker, and run the docker
docker pull xilinx/vitis-ai-pytorch-gpu:latest
./docker_run.sh xilinx/vitis-ai-pytorch-cpu:latest
Then switch the virtual environment and install the necessary packages
conda activate vitis-ai-pytorch
pip install timm einops tqdm
The core code for quantization is as follows:
from pytorch_nndct.apis import torch_quantizer
quantizer=torch_quantizer('calib',model.feature,(input,),output_dir=args.output_path)
quant_model=quantizer.quant_model
loss_fn=nn.L1Loss()
# use 100 images to finetune the quantization models
quantizer.fast_finetune(evaluate,(quant_model,model.tokenizer,loader,loss_fn))
Or you can run the quantization script:
python rendernet_quant.py --checkpoint rendernet_tiny_amd_patch8_ft.pth --dataset-path minitrain3 --output-path quantization_results_256 --img-size 256
Then you can have a cup of coffee and wait until the quantization process is complete. After that, close the wsl and the docker.
3.3.2 Deploying the RenderNet by ONNX sessions
The core code to deploy models by ONNX sessions are shown.
import onnxruntime
sess_options=onnxruntime.SessionOptions()
decoder = onnxruntime.InferenceSession(
checkpoint_path,
sess_options=sess_options,
providers=["VitisAIExecutionProvider"],
provider_options=[{"config_file":r"vaip_config.json"}],)
ort_inputs={decoder.get_inputs()[0].name:patch_embed.numpy().astype(np.float32)}
out = decoder.run(None,ort_inputs)
And, you can use the scripts to get the patch embeddings, reconstructed images, and inference time by different mode.
Get the reconstructed images from original images
python .\rendernet_onnx.py --input-file .\ILSVRC2012_val_00045087.JPEG --output-file reconstructed_image.png --mode encode_render --checkpoint quantization\quantization_results_512\Sequential_int.onnx --img 512
Get the patches' embeddings from original images
python .\rendernet_onnx.py --input-file .\ILSVRC2012_val_00045087.JPEG --output-file test_img_patch_embed.pth --mode encode --checkpoint quantization\quantization_results_512\Sequential_int.onnx --img 512
Get the inference time by the random inputs
Test inference time on NPU device
python .\rendernet_onnx.py --mode benchmark --checkpoint quantization\quantization_results_256\Sequential_int.onnx --target NPU --img-size 256
Test inference time on CPU device
python .\rendernet_onnx.py --mode benchmark --checkpoint quantization\quantization_results_256\Sequential_int.onnx --target CPU --img-size 256
4. EvaluationThe quality of image generation tends to be subjective, making it challenging to evaluate the model quantitatively. Additionally, AutoEncoder models are often used in conjunction with other models, such as large language models and stable diffusion models. We show some reconstructed images below. It can be noticed that the low frequency information is well reconstructed, but the high frequency information like text on the ship is difficult to be reconstructed.
This project focuses on the deployment of the CST-AutoEncoder, so we tested the inference times of the models on both NPU and CPU for different resolutions.
Our tests showed that the NPU can accelerate inference by 11.1% at 256 resolution compared to the CPU, and by 19.9% at 512 resolution. Using the NPU significantly speeds up model inference, particularly for tasks requiring substantial computational resources.
5. SummaryWe have designed a CST-AutoEncoder for AMD AI PCs that easily obtains a discrete feature representation of an image in terms of color, shape, and texture, and is capable of reconstructing the image based on these features. Subsequently, we quantized the model, converted it to ONNX format, and deployed it using the ONNX Runtime.
In the future, we plan to combine the encoder with a large language model (LLM) to equip the LLM with the ability to perceive images. Simultaneously, we can leverage the autoregressive capabilities of the LLM to generate the color, shape, and texture categories of images. Ultimately, we aim to unify image understanding and image generation within the LLM.
Comments