As I have a background in mobile mapping, I decided to build an image restoration pipeline that improves images from mobile mapping systems like cars and drones (Source). Mobile mapping systems are typically used to capture 3D data from roads, cities and infrastructure. Most mobile systems use cameras and Laser Scanners to capture the 3D data. As the mapping systems only take one image from one scene, image quality is highly important. So, the idea was born to build a processing pipeline that improves image quality. As the mobile mapping requirements are very special, I decided to build a more common solution that can be adapted to mobile mapping problems.
Another requirement for mobile mapping applications is energy efficiency. Most systems have limited energy resources, especially drones. So, I decided to keep an eye on the energy consumption of the system. I think this is not only for mobile mapping applications important as the electrical power consumption for cloud systems is nearly exponentially increasing until 2030 (Source)
As the VCK5000 card is normally used in data center applications, not on mobile mapping cars is decided to name my Project "Green computing: Versal based image restoration pipeline"
IntroductionThis project introduces an image restoration processing pipeline that is based on the UNet convolutional network. The image pipeline is designed for a Versal VCK5000 card and trained with a medium-sized SIDD dataset. The whole processing pipeline is optimized to run efficiently in terms of performance measured in frames per second (fps) and accuracy compared to a GPU-based inference. In addition to the pipeline development, a detailed study of the power consumption between a Versal and a GPU system was carried out. This project focus is focused to covers three different requirements:
- Energy consumption of the image restoration pipeline
- Process a specific number of frames per second at a specific model accuracy
- Reliable inference time and scalability
The image restoration processing is trained to remove the noise of an image, as in the example below. The image pipeline is optimized for smartphone camera images. A possible application could be a cloud-based image enhancement service.
The VCK5000 image restoration pipeline which is developed by this project outperforms a state-of-the-art GPU in terms of performance (fps) and power consumption. Besides the performance analysis, detailed analysis shows how different training and quantization steps can affect the accuracy of a convolutional network. The analysis is not fixed to a specific model or network, the needed steps can easily be adapted to a custom application. The final network which is processed by a Versal VCK5000 is ranked in the TOP 15 "Image Denoising on SIDD" networks on paperswithcode (30/03/2022).
After the network optimization, detailed power analysis of the Versal system is done. The power consumption of the Versal system is compared to a GPU-based inference.
Restriction: Besides the inference tasks, training and pre-processing of a neuronal network is highly energy demanding. The focus of this work is runtime power consumption and inference performance. The analyzing of the training and pre-processing depends mostly on the size of the training dataset (opinion of the author) and is not part of this project.
Project OverviewThe code structure is inspired by the Xilinx/AMD Vitis-ai tutorials. All the needed steps are separated into different python or shell scripts. The script run_all.sh processes all steps to get the whole processing pipeline.
UNetThe UNet network which is used for the pipeline was initially developed by the University of Freiburg. The network was originally designed for Biomedical Image Segmentation tasks. Besides the Segmentation tasks, the UNet structure can be used for image restoration. This paper presents an image Restauration network which is based on UNet, the presented network outperforms current networks in image Restauration tasks.
Source: University of Freiburg
The network is fully convolutional and has an u-shaped form. The left side of the "U" is the contracting path and the right side is the expansive path. One important characteristic of UNet is that there are a large number of feature channels in the upsampling part, which allow the network to propagate context information to higher resolution layers.
SIDD-DatasetSIDD is the short form for "Smartphone Image Denoising Dataset". The Dataset contains raw (noisy) and processed (ground-truth) images obtained with smartphone cameras, and is available in three different sizes (small, medium, full). This project uses the Medium-Size which is about 20 GB and contains 96.000 images for training and 1280 for validating.
Comparing images by human eyes is difficult, especially when the differences are minimal. In this work, the PSNR/SSIM metric is used to compare images.
The peak signal-to-noise ratio (PSNR) is used as a quality measurement between the original and a compressed image, and its unit is the decibel (dB). Since PSNR is used to compare the UNet output with the ground-truth image, his value is a good metric to compare different quantization methods. (Source). For Training, we compare the PSNR value from the UNet output with the corresponding ground-truth image.
The Structural Similarity Index (SSIM) metric is used to measure the similarity between two given images. The comparison between the two images is performed on three basic features: Luminance, Contrast and Structure. These three features are compared separately and equality weighted to obtain an SSIM value for each pixel of the compared images. (Source) The SSIM output range is from 0 to 1. A SSIM of 1 means that the two images are identical. In the opposite way, an SSIM metric of 0 means the two images are completely different. For training, we compare the SSIM value from the UNet output with the corresponding ground-truth image.
PC-SystemThe PC-System must be able to run an RTX3090 and a VCK5000 card. The VCK5000 needs Ubuntu 18.04 (Kernel 5.8) to get up and running. The Kernel version is mandatory to get the card running. See Hackster Post. To ensure comparable power consumption, the measurements of both cards must be done with the same setup. The configuration of the PC System in detail is:
- AMD Ryzen ThreadRipper PRO 3955WX
- ASUS WRX80 Pro WS Sage SE Wifi (Bios: PCIe Lanes are 3.0)
- 64 GB DDR4 RAM
- Asus RTX3090 TUF
- Ubuntu 18.04 (Kernel 5.8) Patches for AMD Ryzen ThreadRipper PRO
- Vitis-AI 1.4.1
- 1200 Watt Platin PSU
To measure the power consumption of the system while inference a hama power meter is used. This power meter is used for continuous power measurement. The power meter can log the energy consumption over a specific time and sum up the energy need. The typical measurement period for all setups was 1, 5 hours. To avoid initial power peaks in the measurement, the inference task is running for 2, 5 hours. The power measurement is started after 30 minutes.
AI inference on FPGAsA short introduction to AI inference is done in this chapter, this article goes much deeper and gives more details about FPGAs and their use cases. As neuronal networks mostly use floating-point numbers, FPGAs are not able to run neuronal network inference right out of the box. The floating-point processing engines are slower clocked and less available on FPGA devices. This is one reason why neuronal networks have to be quantized for FPGA inference.
UNet TrainingThe UNet network is trained with the SIDD Medium dataset, after training the UNet network with 250 Epochs. The best PSNR value is 39, 5937 dB for epoch 228. SSIM for the best model weights is 0, 968954. If the network is processed in floating-point mode on a GPU, this was the best output result we can get. Train the UNet network is the first step in run_all.sh
UNet After Training QuantizationQuantization refers to techniques for performing computations and storing tensors at lower bit widths than floating-point precision. A quantized model executes some or all of the operations on tensors with integers rather than floating-point values. A Quantization of the network parameters normally proceeds after the training, and usually leads to an accuracy loss. Running normal quantization the UNet PSNR is reduced to 27, 761646 dB SSIM is 0, 836058. Quantization of the UNet network is the second step in run_all.sh
As we can see, the quantization results in a 12 dB accuracy loss, then we need to improve the quantization results. Vitis-Ai offers a "Fast Finetune" to improve accuracy: AdaQuant algorithm-based process. Then the network output result is slightly better ( PSNR: 28, 352730 ; SSIM: 0, 838978). This Fast fine-tune of the UNet network is the third step in run_all.sh
Both previous methods were described using the final trained floating-point network as input. The third method described in this section, trains the network, from scratch. The mechanism of quantization aware training (qat) is simple: it places quantization modules, i.e. quantization and dequantization modules, at the places where quantization happens during floating-point model to quantized integer model conversion, to simulate integer values. The fake quantization modules will also monitor scales and zero points of the weights and activations. Once the quantization aware training is finished, the floating-point model could be converted to a quantized integer model immediately using the information stored in the fake quantization modules. Compared to the other quantization techniques qat train the network from the bottom up.
To use qat we have to modify the network structure to enable the Xilinx QatProcessor. The QatProcessor automatically inserts all fake layers and converts from floating-point to integer. (Source) We have modified the model for two major points:
- All quantizable operations must be instances of torch.nn.Module
- All layers must have a unique name
At this point it's important to double-check the floating-point model performance while preparing the qat, to ensure that the model is working in the correct manner. The model source code can be found in the project's GitHub repo. I retrained the qat model with normal floating-point training and double-check the output performance. For the UNet image, Restauration pipeline quantization aware training improves the model output to PSNR: 33, 6874 dB and SSIM: 0, 925673. To run qat for UNet use qat.py in run_all.sh
Using qat for model training results in great accuracy compared to the floating-point model. Normal quantization and fast_finetune didn't get the model parameters to generate floating-point accuracy. But with qat, we can get even closer to perfect model output. The Vitis-Ai model Zoo is also trained with qat, so Xilinx did the qat training job for you.
The qat quantization bit width is eight-bit. But if we want more throughput resulting in more processed frames we can tail to four-bit. Reducing the bit width should result in a less accurate model. But reducing the model parameters to four-bit results in a faster execution time. This step is optional because we only test the qat output performance with eight-bit on the VCK5000 card. Changing the qat bit_width is a simple modification of the input parameter for the QatProcessor (Source) must be changed. A Qat with four-bit gives PSNR: 20, 8743 dB and SSIM: 0, 728075
Comparing the GPU performance with an FPGA is not as simple as it sounds. The inference tasks for a GPU differ. The GPU tasks are scheduled by software (CUDA) because we use a Nvidia RTX3090 GPU. The underlying scheduler adds tasks to the Tensor or CUDA cores. The Scheduler also tries to optimize the data copy process from the main GPU memory to the local Core memory to maximize core efficiency, but this is a whole different issue. (Source, Source)
In General, data copy is a time-consuming task for a GPU or FPGA, especially copying data from the host memory to the device memory, meanwhile, this behavior is different on embedded devices. The RTX3090 GPU use PCIe 4.0 x16, while the VCK5000 use PCIe 3.0 x16. To level the data rate from the host memory to the PCIe devices all PCIe Lanes are configured to PCIe 3.0 via BIOS. The PC System run without a Monitor to reduce external GPU Load.
But in the end, typical application requirements for an ai inference task can be:
- Energy consumption
- Process a specific number of frames per second at a specific model accuracy
- Reliable inference time and scalability
Important Note: The VCK5000 is processing a quantized UNet network while the GPU is processing a floating-point UNet network.
To measure the energy consumption while inference a hama power meter was used. The inference run in a loop of 2000 input images. The output images are saved on the internal SSD. The GPU process the trained floating-point model with CUDA support batch size 1. The VCK5000 processes the qat model batch size 1 in 8PE@350 MHz in Gen3x16 mode. The resulting performance is not the raw theoretical throughput, because the network output is checked and saved on the system SSD.
The power consumption is measured with only one device (GPU or FPGA) is running. Minimizing the powerful influence of the cooling (Source) each device is measured in the same PCIe slot to give both devices the same environmental conditions. The room temperature was measured by a logger and was more or less constant at 19, 5 degrees celsius. All values Power, FPS is captured by hand every 20 Minutes.
The diagram shows that the Versal System is more power-efficient than the GPU. The average power consumption is 80 watts lower than the GPU.
Inference FPS VCK5000 vs. RTX3090 GPUBesides the power consumption, the processing power was measured. The diagram shows that the GPU is running at nearly 18 FPS with a jitter. The VCK5000 process nearly 40 images per second. The standard deviation of the processed frames per second:
- GPU: 1, 134 FPS
- Versal: 0, 2344 FPS
This project is just a start for optimizing the UNet pipeline for accuracy and power efficiency. The following topics can be addressed as a follow-up to improve the processing pipeline:
Image preprocessing
At the moment the input image is preprocessed by the CPU to fit as a network input. This task can easily be implemented on the Versal VCK5000.
Direct Versal storage
Loading the input image direct via PCIe DMA transfer to the VCK5000 memory, reducing CPU load for reading and writing images. Direct storage can be added as a top on for image preprocessing, eliminating CPU load completed from the ai processing tasks. Microsoft implemented direct storage for GPUs (Source).
Batch processing and pipelining
The internal data flow on the VCK5000 can be optimized by pipelining image loading, preprocessing, ai inference and image storing. In an ideal world, four images are processed in different stages at the same time on the VCK5000.
Conclusion
The first part of the project is a state-of-the-art image restoration pipeline. The pipeline can be processed by a VCK5000 Versal accelerator card. Compared to other state-of-the-art networks, the pipeline with a PSNR of 33, 6874 and an SSIM of 0, 925673 is ranked for both metrics in the TOP 15 (Source, Source) (date: 03/30/2022) networks processing the SSID dataset.
On the other side, Developers, System architects and all people which are interested in FPGA inference can use this project as a starting point to check for they ai inference requirements. This project helped to get a better understanding of how to meet the following requirements:
- Energy consumption
- Process a specific number of frames per second at a specific model accuracy
- Reliable inference time and scalability
The first part of this project is in a simple way how to analyze a quantized PyTorch UNet network. This is done with three different methods (quantization, fast-fine-tune quantization, Quantization Aware Training). The Quantization Aware Training generates the best output accuracy for the UNet model with a PSNR of 33, 6874 dB. Compared to the floating-point model, the PSNR is 6 dB lower.
In the second part of the project, the power consumption of the model calculation of the Versal card is compared to a GPU. The power consumption in general is 80 Watts lower than the needed by the GPU. Assuming a 24/7 workload the Versal VCK5000 can save in 24 hours: 1, 92 kWatt, 13, 44 kWatt in a week and 683 kWatt in a year.
The third part of the project compares the process performance between the Versal and a GPU. Processing a binary network on the FPGA leads to 100% more fps. Besides the effect of more processed frames, the Versal pipeline has a more constant processing flow in terms of frames per second. The standard deviation of three-hour inference is: 0, 2344 FPS
Closing the circle from the beginning, Ai models help to solve fundamental problems like crop failures, food waste, traffic steering, etc., which can be calculated by FPGAs. Ai inference based on FPGA accelerators is extremely effective, so FPGAs can help to reduce global energy needs for computing systems and save a lot of resources.
Comments