MicroZed Chronicles: Creating Video Streams with Vivado HLS
How to create a custom image processing HLS block using Vivado HLS.
One of the great things about programmable logic is its ability to implement image processing pipelines in parallel. This means we a obtain a higher performing image processing system.
Throughout the years of this blog we have looked at the image processing IP that comes with Vivado, Vivado HLS, and of course, the Xilinx XFOpenCV libraries.
However, what happens if we need to develop our own image processing block using Vivado HLS?
Creating a custom IP block is pretty simple. The first thing we need to do is define the inputs and outputs. To be able to integrate them with our existing design and IP, we need to use AXI Streams for interfacing.
The best way to get started is to define our own video stream. Within this stream, we include:
- tLast and tUser — Side band signals used to indicate the start of a new frame or end of a line.
- Pixels — Single or multiple pixel, each pixel must also be configured for the number of channels it contains.
To define this video stream, we can use several features from the Vivado HLS library.
The first of these is the ap_int.h library. This allows us to define arbitrary precision lengths variables.
We can use this arbitrary precision length to define the size of each pixel element.
For example, if we have a 24-bit red, green, blue pixel, we can define each color channel as using 8 bits.
If we are working with a greyscale color space instead, this could be defined as 12 or 16 bits depending on the output of the camera.
typedef ap_uint<8> pixel_type;
With the pixel type defined, we are able to create the reset of the video stream structure for our interfacing.
struct video_stream {
struct {
pixel_type p1;
pixel_type p2;
pixel_type p3;
} data;
ap_uint<1> user;
ap_uint<1> last;
};
Defining this structure enables us to be able to create the input and output stream of our HLS module.
In this example module, I am going to show how we can read in pixels, store them in a buffer for two samples, and then output the data.
While this is a simple example, it demonstrates everything we need to be able to work with and create more complex examples as required for our application.
Let's take a look in more detail:
void line_convert(video_stream* stream_in, video_stream* stream_out){
#pragma HLS CLOCK domain=default
#pragma HLS INTERFACE ap_ctrl_none port=return
#pragma HLS INTERFACE axis port=stream_in
#pragma HLS INTERFACE axis port=stream_out
ap_uint<48> buffer;
ap_uint<delay> has_last;
ap_uint<delay> has_user;
bool last = false;
bool delayed_last = false;
Using the video_stream structure previously created, we can use these for the input and output variables in the function call.
To make sure these are implemented as AXI Stream interfaces, we can use HLS pragmas to define the stream_in and stream_out as AXI Streams. I also use a HLS Pragma to ensure the control interface is implemented over AXI Lite.
To be able to buffer the data required, I declare three ap_uint types of different precision.
Pixel values are stored in one, long vector called buffer.
The arbitrary vector length is determined by the delay we desire (in this case two) and the size of the element to be buffered. As a pixel is 24 bits long we need a 48-bit arbitrary vector to delay the stream by two samples.
As the side band signals tLast and tUser signals are single bit vectors, two elements are required for a delay of two.
while (!delayed_last) {
#pragma HLS pipeline II= 2
delayed_last = last;
for (int j = 0; j < delay; ++j) {//buffer in two pixels
if (!last) {
buffer.range(j*24 + 7, j*23) = stream_in->data.p1;
buffer.range(j*24 + 15, j*23+8) = stream_in->data.p2;
buffer.range(j*24 + 23, j*23+16) = stream_in->data.p2;
has_user[j] = stream_in->user;
has_last[j] = stream_in->last;
last = stream_in->last;
++stream_in;
}
}
if (!delayed_last) {
for (int i = 0; i < delay; ++i) {//write out 2 pixels 12 bits
stream_out->data.p1 = buffer.range(i*24 + 7, i*23);
stream_out->data.p2 = buffer.range(i*24 + 15, i*23+8);
stream_out->data.p3 = buffer.range(i*24 + 23, i*23+16);
stream_out->user = has_user[i];
stream_out->last = has_last[i];
++stream_out;
}
}
}
The actual body of the code loops until it sees the delay_last signal. This is the buffered tLast signal to ensure all of the buffered data has been output.
The code then uses a loop to read in the input buffer and another loop to output the delay.
Both of these loops are under the control of the respective last signal, either the incoming tLast or buffered tLast.
The pixels are read into or written out of the buffer using the buffer range property. This allows us to select the MSB and LSB we wish to work with in the overall buffer.
If we wanted to implement processing functions, we could do this in place of the buffer.
To ensure that when we synthesize this for the target hardware the performance is acceptable, I use a HLS pragma to define the initiation interval.
The initiation interval is the number of clocks required before the loop can take a new input. In this case, as we are buffering two samples, we should define this as two.
The results of the synthesis show the desired initiation interval is achieved. It also shows the logic resources required for this simple buffering function is minimal.
Now we know how we can create a HLS IP core that works with AXI Stream exactly as we need to accelerate our image processing platforms.
See My FPGA / SoC Projects: Adam Taylor on Hackster.io
Get the Code: ATaylorCEngFIET (Adam Taylor)
Access the MicroZed Chronicles Archives with over 300 articles on the FPGA / Zynq / Zynq MpSoC updated weekly at MicroZed Chronicles.