One of the most common IP cores I use in my designs and see clients use in thier design is the MicroBlaze processor. It enables us to perform complex sequential operations which are not suitable for implementation in Finite State Machines as they would be to large or complex.
In this project we are gong to examine how we can implement MicroBlaze solutions which give us the best performance for the application implementation.
There are a few simple rules we can use, but this project is going to show demonstrate why this is the case.
MicroBlaze ArchitectureThe MicroBlaze is a Harvard architecture processor, which is highly configurable. by default there are three pre-sets
- Microcontroller
- Real Time Controller
- Application Processor
Along with these three pre-set configuration there are also other optimisation / configuration options which can be used.
The MicroBlaze processor can be implemented with a three, five or eight stage processing pipeline.
One important, element to be understood when getting the best from MicroBlaze is the memory hierarchy.
- Local Memory - Tightly coupled memory with a very low access latency through a dedicated interface
- AXI Block RAM - If local memory is insufficient, or shared memory is required BRAM can be connected through an AXI Interface
- External Memory - External Memory is usually either RAM (DDR or SDRAM) or some form of non-volatile memory such as QSPI NOR.
- Shared Memory - With the AXI Blok RAM or External DDR a common interconnect must be used with other DMA masters.
When it comes to implementation we cane make a number of configuration changes which will increase optimisation.
The first of these is to configure the optional instructions
- Enable Barrel Shifter - This enables registers to be manipulated by shifting and rotating.
- Enable Floating Point Unit - Implement a IEEE-754floating point unit, implementing a FPU significantly increases performance if floating point numbers are used in the application
- Enable Integer Multiplier - Enabled if integer multiplication is used within the application.
- Enable Integer Divider - Enabled if integer division is used within the application.
- Enable Additional Machine Status Register Instructions - Provides additional machine status register instructions for setting and clearing bits.
- Enable Pattern Comparator - Accelerates the pattern comparison within the
- Enable Reversed Load/Store and Swap Instructions - Increases performance when converting between big and little endian elements e.g. network and MicroBlaze.
Another significant element which can increase the performance of the application is the inclusion of Data and Instruction Caches. This is especially important when we are running larger applications from external memory.
As the MicroBlaze is a Harvard architecture, we need both Data and Instruction Caches.
The caches are implemented in Block RAM within the FPGAs fabric.
Now we understand how the different optimisations can have an impact on our MicroBlaze performance lets look at the direct impact of these options.
Vivado DesignFor this project I am going to target an Artix UltraScale+ FPGA on the Opal Kelly XEM8320 board. This provides us with high performance, programmable logic and DDR4 making it ideal for this project.
To get started we need to create a new Vivado project which targets the XEM8320
To target the XEM8320 board you may need to download it first
Once the project is open, create a new block diagram and add into it the DDR4 MIG from the board tab.
To the rest of this system add in a MicroBlaze
Run the block automation and select the local memory of 32KB and for this initial development leave the Cache disabled. Set the Debug module to provide both Debug and UART features.
Once this has been run we can see the initial design
Into this design we will also add in a AXI Timer, connect the interrupt from the timer to the MicroBlaze Interrupt system
With this simple design we can build the application in Vivado and export the hardware to Vitis
Vitis - Base DesignIn Vitis create a new application, which is based on the XSA just exported from the Vivado.
Navigate through the project creation process, and select the Dhyrstone application this will provide us a demanding application to test with and report a figure we can use for comparison.
By default the application is configured to run from the DDR4 memory space, as we did not introduce any Caches, nor additional memory instructions. Running the application we will see the basic performance is pretty low. This is because there is no Cache and the application is running from DDR4 which has a significant latency for each access.
If we go back to Vivado and enable the Extra Instructions and re-implement the design with the additional instructions enabled.
With the application is re run in Vitis we will see the increase in performance which comes from having the additional instructions included.
Of course we are still suffering from the lack of Cache in the overall performance as such the time taken to access DDR memory has the biggest impact on performance.
Updating our Vivado design to include Caches for the instruction and data means adding in AXI Bram connected to the cache interfaces on the MicroBlaze.
Re Running the application with the Cache enabled give s good increase in the overall system performance
Of course, the performance of the application will vary depending upon the
Vitis - Running From BRAMIf we change the linker script to run from only BRAM, we see the similar performance as BRAM on the local memory bus is tightly coupled to the MicroBlaze.
Having configured the MicroBlaze to provide, the best performance possible we still often need to optimise our applications running on the processor. This might be changing the algorithm to a more efficient implementation or looking at using a FPGA IP Block.
We can implement processing very simply in our MicroBlaze projects by using TCF profiling.
To provide an example on how to use TCF profiling we can use a the project below which implements a PID controller.
#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
#include "xparameters.h"
#include "xintc.h"
#include "xil_exception.h"
#include "mx517.h"
#define iterations 40
static data_type error_prev =0;
static data_type i_prev=0;
XIntc InterruptController;
data_type PID (data_type set_point, data_type KP, data_type KI, data_type KD, data_type sample, data_type ts, data_type pmax)
{
data_type error;
data_type p;
data_type i;
data_type d;
data_type op;
error = set_point - sample;
p = error * KP;
i = i_prev + (error * ts * KI);
d = KD * ((error - error_prev) / ts);
op = p+i;//+d;
error_prev = error;
if ((op) > pmax) {
i_prev = i_prev;
op = pmax;
}else{
i_prev = i;
}
return op;
}
int main()
{
init_platform();
print("Hello World\n\r");
print("Successfully ran Hello World application");
//IntcInterruptSetup(&InterruptController, XPAR_INTC_0_DEVICE_ID);
data_type set_point = -80.0;
data_type sample[iterations] = {-90.000,-88.988,-87.977,-86.966,-85.955,-84.946,-83.936,-82.928,-81.920,-80.912,-80.283,-79.926,
-79.784,-79.774,-79.829,-79.898,-79.955,-79.993,-80.011,-80.017,-80.016,-80.010,-80.005,-80.002,-80.000,-79.999,
-79.999,-79.999,-79.999,-80.000,-80.000,-80.000,-80.000,-80.000,-80.000,-80.000,-79.999,-80.000,-80.001,-80.000};
data_type kp = 19.6827; // w/k
data_type ki = 0.7420; // w/k/s
data_type kd = 0.0;
data_type op;
printf("testing cpp\r\n");
for (int i =0; i<iterations; i++){
op = PID (set_point, kp, ki, kd, sample[i], 12.5, 40);
printf("result %f\r\n", op);
}
cleanup_platform();
return 0;
}
When we debug the program as the program halts like below open the profile view before running.
You can find this on the Window -> Show View enter TCF in search and select TCF Profiler.
Once the TCF view is open select options and enable stack tracing
Running the application will then enable us to see the profiled results of the application.
We can explore this profiled results to identify bottlenecks in our application.
Wrap UpThe project has shown that to get the best from our MicroBlaze applications we need to think carefully about the end application. For example if we are using floating point algorithms we need to make sure the FPU is enabled.
Instructions such as the barrel shifter will result in significant improvements, as it enables shifts and rotates of registers a commonly used feature in programming.
The critical aspect to recovering performance if DDR is used is to ensure a cache is implemented within the Block RAM and that it is correctly configured to provide maximum performance.
Once we have our MicroBlaze all up and running on what should be an optimum configuration, we can use TCF profiling to determine how our application is running on the target and potentially use that to feed back into configuration changes of the MicroBlaze if necessary!
Comments