For models generation, a Laptop with Xeon E3-1505M processor, Quadro M2200, with 4 GB GPU memory, and 32 GB system RAM using Windows 10 is used as a training computer. In addition, TensorFlow-GPU 1.12 is used in an Anaconda environment.
The Inference computer is a Desktop with Ryzen 7 5800x in x570 MOBO, VCK5000, remember its 16 GB off-chip memory, in the 16x PCIX16_1 and a FirePro W2100 in PCIX16_3 for graphical interface and SSD. The computer has 32 GB of system RAM and Ubuntu 20.04 as OS.
The Seed CardAcquiring information to analyze and train AI algorithms is the first challenge. In the present development, 60 seeds images from 6 kinds of plants for a UNet architecture were taken in a grid card. The card consists of squares of 1.5 cm to set a seed in each one. A row has 10 cells for one kind of seed and is identified with the letters from A to F for labeling purposes.
The image above was acquired with an Arducam C lens with manual focus and aperture adjustment mounted on a camera. The camera storage images with a resolution of 1920x1080 pixels. In addition, the seed card was in a box to minimize the influence of external light. The light source was a polarized ring and the optics has a polarization crystal to adjust the angle.
According to the training computer, the acquired images must be resized, consequently, the seeds could be deformed and the details will be minimized in the preprocessing stage. To reduce this kind of issue in a low-resource computing system, the images were acquired for each seed in the card with a 100x lens.
Consider facedetect demo, this application uses densebox architecture with a 320x320 pixels resolution.
Because the reduced resolution could minimize detail level, the seeds acquired with 100x lenses preserve this information in the images.
In order to scale the project in the future, there is a small assumption about the image resolution, the form factor in the camera is 16:9, consequently, a scale that holds this form factor could be used,
A scale factor that achieves an integer Width and Height should be used to perform a subsampling of the image without the pixel modification. In addition, the scale will adjust the memory required for training and the detail level present in the training process. For the present work, an scale of 6 was selected.
Image Segmentation (Inference)The architecture used is the UNet. The original base architecture is the following,
The CNN-based architectures present some challenges. The first one is the computational power required. The second is the acceleration of the training process, commonly it is required a GPU with sufficient Memory to store the information. Finally, the original architecture is modified to accept 320x180 pixels, scale 6, images, and have a padding parameter to obtain the same dimensions in the output as the input.
The AI is a powerful tool, but if the training process has not the adequate data the AI tends to reduce its performance. Moreover, the architecture is adjusted for the GPU/CPU computing resources, this reduces the inference metrics and could isolate zones not belonging to the seed or another Region of Interest Under Test (RUT). Look at the next example,
As a final product, we obtain a .png file that allows the codification of segmentation in the alpha channel. Furthermore, the information in the other channels remains like the original one.
ROI isolation and characterization (Post-processing)There are some regions that are not valid for the instance of interest in the image. To remove those invalid regions, each one is characterized by statistical features. This allows detecting the region near to image center and mounting its features,
In addition, the seed is reoriented and trimmed by the algorithms to reduce the information and isolate the corresponding seed in the acquired image. For the next example, the isolated RUT was resized to cover the original image dimensions, seed isolation, and annotate the image with the Statistical Features.
As the final product, we obtain two .png files with annotated image and seed isolation respectively. Optionally, an auxiliary file is saved with the seed features.
The processThe first step is to take each acquired image and manually segment those instances that are valid for the study of interest. The GIMP software was used for this purpose. It is necessary to teach algorithms what is our objective, for this, the seed segmentation is the language that the UNet understands. Originally, the images have no information about their meaning, consequently, a user must provide this information. The segmentation must be for each acquired image to obtain the largest amount of information for the algorithm.
The second stage is to define the UNet, you can consult the model implemented in the Code Section. This model has an input size of 320x180 pixels for color images to avoid GPU memory overflow. In the training computer, the model requires about 35 seconds for each epoch. The product of this stage is a .hdf5 file containing all hyper-parameters of the designed model. The file is stored as archA_epE_hxwD.hdf5, where arch, ep, h, and w are codified as,
- arch, architecture model implemented, 2 for the minimal UNet, 1 for the Medium UNet, and 0 for unmodified architecture.
- ep, epoch quantity for the training process of the architecture
- h, w, the model input size
Despite the Inference computer being powerful, the training process is performed in another one because it has an NVIDIA board as the accelerator. Furthermore, the freeze process is performed in that computer too.
The acquisition, manual segmentation, model quantization, model compiling, and application are performed in the Inference computer. Using Vitis-AI toolset in docker the last three stages are executed. Before starting, the environment needs to be adapted for all the frameworks support. This work is based on the FCN8 and UNET Semantic Segmentation with Keras and Xilinx Vitis AI and PyTorch flow for Vitis AI tutorials. Remember! Vitis-AI uses TensorFlow 1.15 as one of its environments.
If you are using a graphic card with the DisplayPort connection it is possible you need to execute the following
export DISPLAY=0.0
You may wonder how to Compile the model for your board. The examples have information for ZCU or VCK190 but what about Kria, VCK5000, and others? You can execute the command below to see all the supported DPUs for the available boards
tree /opt/vitis_ai/compiler/arch/
If you do not remember the hyper-parameters of the model you can use the vai_q_tensorflow to obtain information
vai_q_tensorflow inspect --input_frozen_model MODEL.pb
PerformanceThe training inference time for a single image process is about 153 ms with 160x360 pixels image resolution with the trained model. It is important to compare in the same conditions for either model variants Freeze TensorFlow model and Compiled model for VCK5000. In addition, it is convenient to use the same language, Python, and for one thread.
The TensorFlow script for multiple image input for the model can be consulted in the Code section. It is possible to observe the measured time is only for the tensor processing on GPU
The test uses two trimmed datasets: one with 13 images and the other with 47 images, both conform the dataset. From the table above, it is observed an improvement in Vitis AI compared with TensorFlow. Unfortunately, the GPU has not the sufficient memory to allocate the 47 images in the training computer, consequently, the inference is not possible in one thread at the same time for the bundle of images. On the other hand, the inference computer can handle those images, this allows a throughput of 629 FPS. To compare both computers, the second bundle can be allocated in the systems. TensorFlow achieved a throughput of 7 FPS mean while Vitis-AI improve the processing to 480.05 FPS.
ConclusionsIt is a time processing reduction observation according to the graph above. The data center inference approach requires about 1.5% of the time consumed in the Laptop, this means an improvement of 67x.
The two systems require a delay time to load the architecture and start to process. Vitis-AI requires more time to load the .xmodel than the .pb in TensorFlow, but the inference computing time is a lot inferior than the required for the .pb.
For continuous work and use of the inference service in Data Centers, we can wait more time to load the model for heavy acceleration. In production environments the user is not tolerant with the computing times, consequently, I believe VCK5000 is a great product to improve the throughput in our AI implementations
What's next?There are more inferencing challenges in this project. Segmentation is the most popular stage in the processing chain. Now we know where is the seed, how it is visualized, and what is the meaningful information. But other involved branches are,
- Whan kind of seed am I seeing?
- Has the seed an adequate form factor?
- According to World Intellectual Property Organization (WIPO), Does the seed accomplish the unique, homogeneous, and stable criteria?
Seeds need to meet quality standards, and if the vegetative material must be protected by organizations for commercial purposes, they require to cover a rubric for registration. The digital approach allows information registering and traceability without aging information damage in long-term storage mediums, sharing results over a network, and post-processing for new evaluation methods.
Xilinx, as a cloud solutions provider, improves information processing, reducing the processing time and centralizing the organization's inferencing requests. The On-premise products such as VCK5000 or C1100, and the Cloud provider services like F1 instances of Amazon Web Services (AWS) represent a continuous improvement in high-performance computing (HPC)
Comments