MicroZed Chronicles: Focusing on HLS Timing and Initiation Interval Violations
How to focus in on the initiation interval and timing violations in Vivado HLS.
As you will be aware, I do a lot of High-Level Synthesis (HLS) design for clients, especially for image processing applications. One of the great things about HLS is the productivity it brings when creating the application and its verification.
However, when the performance of our HLS block is not as expected, being able to find the critical path which impacts the violation is crucial. We have looked before at the analysis view and the potential optimizations which can be used to increase performance.
In this blog, we are going to examine how we can focus on finding the timing and initiation violations within our HLS designs, and of course, correcting them.
Let’s take a simple example of a Test Pattern Generator (TPG). The custom TPG will load an image over the S AXI link from the PS of a Zynq device and then output this image at very fast frame rates. Such an approach is often used to verify image processing algorithms. The image, once loaded, is stored in BRAM / URAM depending upon the device which has been selected for implementation. In order to achieve the high frame rates required, we are going to output multiple pixels per clock.
This code can be written simply using memcpy to read in the image from the PS DDR and two for loops to output the data correctly
Source Code
#include "Blog_352.h"
#include "ap_utils.h"
void tpg(axis& OUTPUT_STREAM,
int lines,
int pixels,
int line_start,
int pixel_start,
image_in* image){
#pragma HLS INTERFACE m_axi depth=327680 port=image offset=slave bundle=image
#pragma HLS INTERFACE axis register both port=OUTPUT_STREAM
#pragma HLS INTERFACE s_axilite port=return
#pragma HLS INTERFACE s_axilite port=lines
#pragma HLS INTERFACE s_axilite port=pixels
#pragma HLS INTERFACE s_axilite port=line_start
#pragma HLS INTERFACE s_axilite port=pixel_start
image_in frame[MAX_WIDTH*(MAX_HEIGHT)];
memcpy(frame,image,(MAX_WIDTH*(MAX_HEIGHT))*sizeof(image_in));
tpg_gen(OUTPUT_STREAM,lines,pixels,line_start,pixel_start,frame);
}
void tpg_gen(axis& OUTPUT_STREAM,
int lines,
int pixels,
int y_start,
int x_start,
image_in* frame){
VIDEO_COMP tpg_gen;
int i = 0;
int y = 0;
int x = 0;
outer_loop:for (y =0; y<lines; y++){
#pragma HLS LOOP_TRIPCOUNT max=513
inner_loop:for (x =0; x < (pixels); x+=2) {
#pragma HLS LOOP_TRIPCOUNT max=640
#pragma HLS PIPELINE II=1
if (y == 0 && x == 0 ){
tpg_gen.user = 1;
tpg_gen.data = ((frame[((y+y_start)*MAX_WIDTH)+((x+x_start)+1)] << 16) |frame[((y+y_start)*MAX_WIDTH)+(x+x_start)]);
}
else
if (x == (pixels-2) ){
tpg_gen.last = 1;
tpg_gen.data = ((frame[((y+y_start)*MAX_WIDTH)+((x+x_start)+1)] <<16) | frame[((y+y_start)*MAX_WIDTH)+(x+x_start)]);
}
else{
tpg_gen.last = 0;
tpg_gen.user = 0;
tpg_gen.data = ((frame[((y+y_start)*MAX_WIDTH)+((x+x_start)+1)] <<16) | frame[((y+y_start)*MAX_WIDTH)+(x+x_start)]);
}
OUTPUT_STREAM.write(tpg_gen);
}
}
}
Header File
#include "hls_video.h"
#include <ap_fixed.h>
#include "string.h"
#define MAX_WIDTH 640
#define MAX_HEIGHT 513
#define WIDTH 32
typedef hls::stream<ap_axiu<WIDTH,1,1,1> > axis;
typedef ap_axiu<WIDTH,1,1,1> VIDEO_COMP;
typedef uint16_t image_in; //uint16_t
void tpg(axis& OUTPUT_STREAM,
int lines,
int pixels,
int line_start,
int pixel_start,
image_in* image);
void tpg_gen(axis& OUTPUT_STREAM,
int lines,
int pixels,
int y_start,
int x_start,
image_in* frame);
Test Bench
#include "Blog_352.h"
#include <hls_opencv.h>
using namespace std;
int main (int argc, char** argv) {
IplImage* src;
IplImage* dst;
axis dst_axi;
int y;
static image_in data[MAX_WIDTH*(MAX_HEIGHT-1)];
for (y =0;y<MAX_WIDTH*(MAX_HEIGHT-1);y++){
data[y] = y%MAX_WIDTH;
}
int pix, line;
pix = 640;
line = 513;
dst = cvCreateImage(cvSize(pix/2,line),IPL_DEPTH_16U, 1);
tpg(dst_axi,line,pix,0,0,data);
AXIvideo2IplImage(dst_axi, dst);
cvSaveImage("op.bmp", dst);
cvReleaseImage(&dst);
}
However, the desire to output two (or more) pixels per clocks makes for a bottleneck in reading from the array which stores the image.
This bottleneck exists as the image is stored as 16-bit words in each memory location. Reading out two pixels requires reading of two memory locations. This cannot be achieved in one clock cycle unless we correctly partition the BRAM.
When we open the analysis view, we will be presented with information under the module hierarchy, indicating which module if any, is presenting a timing violation or initiation interval violation.
If we only want to focus on the violations, we can click on the timing or II violation button at the top of the module hierarchy.
As it stands, our design indicates a II violation in the tpg_gen function, which is the core of the function that reads the memory and outputs the data over the AXI Stream.
At this point, we know we have a II violation; but we need to be able to find the root cause within our design and correct it.
We can find the root cause by setting the analysis focus to II violation. This will focus the analysis view on the design element which is causing the II violation.
Once the focus is set to II violation in our example, we see that the root cause of the issue reads from the BRAM blocks.
The failing element will be colored blue. This will be the case for all analysis views indicating there is an issue.
Knowing the failing element, we need to be able to identify which line of the code is causing the potential issue.
We can select the operation / control step by right clicking and selecting “goto source." The source line will be cross probed.
Now that we know what the issue is and have identified the source line of code causing the issue, we can begin to implement optimization strategies. For this case, we can partition the Block Memory into a cyclic buffer such that we get two Block RAMs. Each Block RAM stores the data cyclically. For example, BRAM A contains data elements 0, 2, 4, 6 etc., while BRAM B contains 1,3,5,7 etc. This allows the two pixel values to be read in parallel and the desired initiation interval achieved.
Obviously, more complex algorithms may need a little more analysis and optimization, but at least we now know how we can focus in on what the root cause might be.
See My FPGA / SoC Projects: Adam Taylor on Hackster.io
Get the Code: ATaylorCEngFIET (Adam Taylor)