MicroZed Chronicles: HLS and High Speed Imaging
Tips and tricks for creating a two-dimensional high-performance image processing IP core.
Over the last few blogs (P1, P2, P3), we have looked in depth at High-Level Synthesis (HLS) and its use in Image processing.
HLS provides real advantages for image processing as it allows us to focus on our algorithm. We can also achieve very high frame rates when working with HLS, with a little thought about the optimizations we apply.
A few weeks ago we looked at reading a line of data from DDR memory such that we could create a simple test pattern.
For many applications however, we want the ability to inject a two dimensional image into the image processing stream. This gives us the ability to test our image processing algorithms performance using synthetic images.
Creating the ability to readout a two dimensional image from DDR can be both straight forward and challenging. So in this blog, I am going to explain the intricacies along with explaining the optimizations that can be used to achieve high frame rates (> 500 fps).
Fetching a two-dimensional image from DDR also uses the memcpy() function, although this time set to copy all of the pixels in the frame. Of course, to fetch the image from DDR the processing system must have stored the image in contiguous memory locations, with each line following the previous line.
A low performance solution would read a line from DDR before outputting the line. This might be further improved by buffering lines though, to achieve the highest through put the image must loaded into the memory.
If we are using a device which is rich in UltraRAM, we can utilize the UltraRAM to store the pixels extracted from DDR. Thanks to the sequential nature of the pixels in the image, this makes the read out from UltraRAM or BRAM very simple, aiding the performance.
typedef uint16_t image_in;
image_in frame[MAX_WIDTH*MAX_HEIGHT];
memcpy(frame,image,(MAX_WIDTH*MAX_HEIGHT)*sizeof(image_in));
The pixel data can then be read out of the UltraRAM using nested loops as before.
The address of the pixel to be read out is then simply by referencing the line, pixel count and the line length. For example:
outer_loop:for (y =0; y<lines; y++){
inner_loop:for (x =0; x < MAX_WIDTH; x++) {
tpg_gen.data = frame[(y*MAX_WIDTH)+x];
}
}
To increase the throughput and not to increase significantly the resource utilization, the inner loop can be pipelined with an initiation interval of one.
Using URAM and pipelining the inner loop enables, high frame rates to be achieved. The structure of the code is also important as we do not want the memcpy() command to run each time the HLS IP block runs. The reading from DDR will significantly impact the performance.
To address this, I used a AXI Lite register to control the loading or not of the video image.
if (load == 1){
memcpy(frame,image,(MAX_WIDTH*MAX_HEIGHT)*sizeof(image_in));
}
When I was experimenting with this in the hardware, the difference between performing the DDR read every time the IP block runs versus only on demand is a factor of two, e.g. 560 frame per second compared to 1120 FPS.
Another challenge that presents itself when working with large arrays is simulation. The image will require large data storage depend upon the resolution of the image and the number of bits in each pixel.
int main (int argc, char** argv) {
IplImage* src;
IplImage* dst;
axis dst_axi;
int y;
image_in data[MAX_WIDTH*MAX_HEIGHT];
dst = cvCreateImage(cvSize(MAX_WIDTH,MAX_HEIGHT),IPL_DEPTH_16U, 1);
for (y =0;y<MAX_WIDTH*MAX_HEIGHT;y++){
if( y < MAX_WIDTH*10){
data[y] = y % MAX_WIDTH;
}else{
data[y] = 31;
}
}
tpg(dst_axi,MAX_HEIGHT,1, data); //, 1280, 720);
AXIvideo2IplImage(dst_axi, dst);
cvSaveImage("op.bmp", dst);
cvReleaseImage(&dst);
}
Failure to handle this correctly in the test bench will result in the simulation failing due a SIGSEGV error, also known as a segmentation fault. This comes from the compiler trying to allocate the storage on the Stack.
To avoid this issue, we need to allocate the storage on the heap. We can achieve this using one of two approaches:
- Use the Malloc() fucntion
- Declare the variables as Static
Once the storage is allocated on the heap, we are able to run both the C and co-simulations to ensure the performance is as we desire.
Of course running it in the hardware is the proof, with a 640 x 512 image with 16 bits per pixel greyscale
The ZCU104 (ZU7EV) was able to achieve greater than 1100 frames per second when monitoring the TUser line, which indicates the start of a new frame.
This shows how we can implement an HLS high-performance image processing solution that can be used for testing without too much trouble!
See My FPGA / SoC Projects: Adam Taylor on Hackster.io
Get the Code: ATaylorCEngFIET (Adam Taylor)
Access the MicroZed Chronicles Archives with over 300 articles on the FPGA / Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.