The Ultra96 provides both processors system and programmable logic. One of the key benefits of this heterogeneous approach is the ability to accelerate functions from the running on the processing system to being implemented in logic.
Done correctly (we will explore this more further down) moving functions into the programmable logic results in a significant acceleration in performance.
To get the best from this acceleration, we also want to accelerate the design cycle and remove the need to create separate PS / PL developments. These separate design processes have traditionally accounted for an increased development time when accelerating functions into the PL.
This increasing time scale stems from the need to create a hardware description language (HDL) block, verify its performance and create SW to drive the new module.
Ideally what we want is a system optimizing compiler which allows us to move functions between the PS and PL seamlessly and with ease.
Introducing SDSoCSDSoC is such a system optimizing compiler and rather helpfully with the Ultra96 we also get a license for SDSoC.
Using SDSoC we can move functions from the PS to the PL with only an increase in the compile time. Although to get the best performance we need to understand a little about logic design and behavior on hardware.
SDSoC enables movement between the PS and the PL thanks to a combination of the Vivado HLS and a connectivity frame work.
Vivado HLS is called to convert the function from C, C++ into a HDL module for implementation. This module is then mapped into a new Vivado design using the connectivity framework e.g. insertion of DMA.
When a function is accelerated into the PL the accelerated SW function is regenerated to reflect the transfer of data and the control of the PL accelerator.
To use SDSoC we must have a platform definition for the board we are working with. This platform has two elements
- Hardware definition - This defines the AXI interfaces available for the connectivity framework. With the MPSoC it is best to provide cache coherent ports if possible e.g. Accelerated Coherency Port (ACP), AXI Coherency Extension (ACE) or the High Performance Coherent AXI Interfaces. This hardware definition also includes available fabric clocks, interrupts and reset.
- Software definition - This defines the SW architecture for the supported baremetal, freeRTOS and Linux operating systems. As such this includes elements such as FSBL, BIF files, Linux Images, BSPs etc.
I should point out the SW definitions do not need to include all of the supported operating systems, bare metal is the minimum requirement.
Introduction to High Level SynthesisHigh Level Synthesis (HLS) enables developers to work in higher level languages than VHDL or Verilog. Typically these languages are C, C++ and OpenCL. HLS converts the higher level language into a HDL description that can be synthesized by Vivado. To achieve this Vivado HLS goes through the following stages when creating the HDL description.
To provide control over HLS optimizations, we can use pragmas within the source code. While there are many pragmas which can be used some of the most commonly used are
- Data Flow– Allows for optimizations across functions
- Pipeline stages – Defines a iteration interval which is the target for processing a new input to the function
- Partition Memory to increase Read / Write Bandwidth – Breaks down arrays to provide for more read and write options
We will explore these concepts in detail when we examine he example source code.
Library SupportOf course, it further helps reduce the development time if we have a range of libraries available which can be accelerated into the PL. Rathr helpfully SDSoC provides us with support for acceleration of the following libraries.
- HLS Math Library
- HLS IP Library
- HLS Linear Algebra Library
- HLS Arbitrary Precision Data Types
- HLS Video Libraries
- HLS DSP Library
- HLS Stream Library
- HLS SQL Library
To get the best from SDSoC we need to transfer large quantities of data to and from the PL using DMA. If we are transferring small segments of data between the PS and PL the data transfer time will dominate and impact the results of the acceleration.
Amdahl's law can be used as a good indication to the acceleration achieved by moving a function from the PS to the PL.
Where
- S: overall performance improvement
- Alpha: percentage of the algorithm that can be sped up with hardware acceleration
- 1-alpha: percentage of the algorithm that cannot be improved.
- p: is the speedup due to acceleration (%).
Set Alpha to 0.1 and select speed up - even with large acceleration P defined, speed up is close to 1
Set Alpha to 0.5 and select same speed up – close to factor of two improvement.
Getting up and RunningFor this example we are going to use the Ultra96 platform provided by Avnet, you can download it here
This platform contains only support for the baremetal OS but it is sufficient to get us going with using SDSoC.
Once we have downloaded the SDSoC platform the next step is to open SDSoC for the first time. When you do this, you will be asked for the location of the SDSoC workspace, it is within the workspace that your projects will be stored.
With the workspace initialized, you will see SDSoCs initial IDE page.
Before we can use SDSoC the next step is to unzip the SDSoC Platform we downloaded earlier.
You will notice the unzipped SDSoC platform contains two folders Platforms and Prebuilt.
It is within the platforms folder that we will find the Ultra96 SDSoC Platform.
The next step is to add in the custom platform. We can do this by selecting Xilinx->Add Custom Project
This will open a dialog which allows us to manage the custom project
Click on the Add Custom Project button and navigate to the top level of the SDSoC platform for the Ultra96
With the platform added the next step is to create an example project, open the project creation wizard by clicking on File->New SDx Project
One the second page we need to name the project, remember not to use white spaces in the name use underscores instead.
The third page of the creation wizard is to select the platform, select the Ultra96 project which will show in yellow as a custom project.
The penultimate page of the project creation wizard is to define the processing system configuration.
The final page of the project creation wizard is to select the Xilinx Matrix Multiply example project.
At the completion of the wizard you will see the project created and the Application Project Settings tab.
It is this tab which controls the project and what functions if any are accelerated into the programmable logic.
The project is configured to select the mmult_accel function into the PL, this can be seen under the hardware functions tab. Once this has been accelerated into the PL it will be clocked at 100 MHz.
With the project created we can now click on the build button (hammer icon) on the menu bar and start a build.
However, if after doing this you receive the following error after a few minutes you need to make the edits below to your board definition file under Vivado (You did install these right? if not find out how here)
To correct for this error you need to open the board definition file located at
C:\Xilinx\Vivado\2018.2\data\boards\board_files\ultra96\1.2
The error results as the name of the board file is not what is expected by the SDSoC platform.
Open board.xml and change the name from Ultra96v1 to Ultra96
Correcting this file will then enable the project to build and the boot.bin file be created for testing on the Ultra96.
Code Deep DiveBefore run the example application on the hardware, I think we should examine the optimizations made in the accelerated code.
void mmult_accel(float A[N*N], float B[N*N], float C[N*N])
{
float _A[N][N], _B[N][N];
#pragma HLS array_partition variable=_A block factor=8 dim=2
#pragma HLS array_partition variable=_B block factor=8 dim=1
for(int i=0; i<N; i++) {
for(int j=0; j<N; j++) {
#pragma HLS PIPELINE
_A[i][j] = A[i * N + j];
_B[i][j] = B[i * N + j];
}
}
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++) {
#pragma HLS PIPELINE
float result = 0;
for (int k = 0; k < N; k++) {
float term = _A[i][k] * _B[k][j];
result += term;
}
C[i * N + j] = result;
}
}
}
This example uses two of the three most commonly used HLS pragmas ARRAY_PARTITION and PIPELINE
In HLS arrays are often converted into block RAMs, these can be a bottleneck as they have a maximum of two ports. ARRAY_PARTITION fractures the array to be contained in several BRAMs so that parallel access can be made accelerating the performance.
In the code example
#pragma HLS array_partition variable=_A block factor=8 dim=2
This applies the block fracture to the variable _A (32 by 32 matrix) creating 8 BRAM in place of one, this is defined by the factor parameter. While the dim parameter is used in multi dimensional arrays to define the element which is fractured. In this case as dim=2 the second dimension is fractured.
To optimize the performance of the for loops we can use the PIPELINE pragma
#pragma HLS PIPELINE
The PIPELINE pragma enables operations to occur concurrently, without the need for the entire processing chain to complete before the next operation starts.
We can use the PIPELINE attributes to unroll for loops. Unrolling loops trades performance for area, as such care needs to be taken. One rule of thumb is to initially unroll the inner most loops.
While the latency is the same when the PIPELINE command is used, the throughput is significantly increased.
For larger designs with multiple functions, we can use the DATAFLOW pragma to optimize across functions.
The difference between PIPELINE and DATAFLOW is that DATAFLOW is a coarse grain approach which works on functions. While PIPELINING is a fine grain approach working on the operators within a function.
Running the ExampleNow that we understand what is occurring and the optimization that have been made, we can run the application on the hardware. To verify this is working as required we need to access a UART port hence the need for the JTAG UART USB converter.
Within SDSoC we need to select the debugger and configure a new debug environment.
When we run this on the hardware we should see a significant increase in the performance from the accelerated function. This will be reported over the UART, the data rate is 115200 and we can connect using the SDx Terminal.
Now that we understand the flow we will want to develop our own applications and identify which elements to accelerate and how to fine tune the performance of the accelerated function.
Doing this requires a little more understanding of SDSoCs capabilities, including
Profiling - This identifies where time is spent in the execution of the program in the PS. The profiler shows both function inclusive and exclusive execution time. We can use the profiler to identify functions which should be considered for acceleration.
Tracing - Inserts dedicated hardware into the PL to monitor behavior, this enables us to examine how much time is spent in each aspect of the execution. This provides a detailed system understanding during execution.
ProfilingTo open the TCF profiler click on Window > Show View > Other > Debug > TCF Profiler.
Once we have opened the profiler we need to run the debug on the hardware. Before we run the debug application start the TCF profiler by clicking on the run button (circled in red below). Ensure you check the Enable Stack Tracing option.
Once your application completes you will see the TCF profiler populates with information on the execution time of each function.
At this point I should explain what the terms inclusive and exclusive mean
Exclusive: The amount of execution time spent in function alone.
Inclusive: The amount of execution time spent in function and all of its sub-function calls.
TracingOnce we have accelerated the hardware we want to be able to ensure the system is functioning optimally this is where tracing comes in. Tracing inserts monitors in the PL and enables us to see the PS, PL and data transfers.
We enable tracing by checking the enable event tracing option when you generate the SDSoC project.
Once the build completes when we debug on the hardware we also need to check the debug configuration to say we wish to trace the design.
Once the tracing has run to completion you will see another project under the project explorer with the project name and traces appended.
Double click on the AXI trace set up and you will see the graph below which shows a graph of the tracing results.
In the above example Orange is SW, Green is Accelerator in the PL and Blue is Data Transfer.
ConclusionNow that we understand how to install, configure, test, profile, and trace applications in SDSoC, you can get started on project with your Ultra96.
My one day SDSoC Course from FPGA Forum in June 2018
You can find the files associated with this project here:
https://github.com/ATaylorCEngFIET/Hackster
See previous projects here.
More on on Xilinx using FPGA development weekly at MicroZed Chronicles.
Comments