Adaptive computing provides developers a significant boost in performance, system integration of power efficiency.
The AMD-Xilinx Versal devices the first Adaptive Compute Acceleration Platforms (ACAP) released offer a considerable performance advantage to developers.
If you are not familiar with Versal devices they combine Processing Systems, with Programmable Logic, AI Engines and a wide range of high bandwidth interfacing solutions. The exact contents of the device depend upon which family is selected,
Architecturally, the Versal device is comprised of the following elements
- Processing system comprised of a Full Power Domain (FPD) which contains the dual-core Arm Cortex-A72 processors in the Application Processing Unit (APU). The Low Power Domain (LPD) contains the Real-time Processing Unit (RPU) with dual-core Arm Cortex-R5F processors.
- Platform Management Controller – This contains two TMR MicroBlaze instantiations and addresses that control the boot, configuration of the PL, and the overall platform.
- Cache Coherent Interconnect (CCIX) for Accelerators and PCIe (CPM) – As the primary PCIe interface for the PS, it also allows accelerators in the PL to connect to the CCIX and act as a CCIX accelerator.
- Network on Chip (NoC) – This provides device-wide network connection to the PL / AI Engines (on supported devices) and the DDR controllers.
- Programmable Logic – Programmable logic resources which also provide interfaces to the device interfacing options (e.g., 400G Crypto, 600G Ethernet etc.).
Similar to the Zynq SoC and Zynq MPSoC devices, there are dedicated MIO for the PS and PMC.
ArchitecturePlatform Management Controller (PMC) – This manages the overall platform. It provides configuration, bring up, platform management during operation, and enables and facilities the system wide debug and tracing. To do this, the PMC has two processors. The ROM Control Unit (RCU) executes the initial stages of the boot and runs the boot ROM, while the PMC Processing Unit (PPU) loads in the Programmable Device Image (PDI) and the PDI replaces the BIN file used to configure earlier SoCs. The PDI contains several different files including the Platform Loader and Manager (PLM) which loads and processes the different files to configure the device. This includes the CDO files which configure the PMC, PS, Clocking, MIO and Reset. The NPI file configures the NOC, AI Engines, PL, and DDR Controller. Finally, the ELF files are loaded for the processors to begin the application.
Application Processor Unit (APU) – This contains dual core 64-bit Arm A72 processors which support superscalar, out-of-order execution and implement the Arm-v8A architecture. Each of the processors has 48 KB of L1 instruction cache and 32 KB of data cache, and both have protection from errors parity on the instruction and ECC on the data. The processors also provide floating-point units and NEON units to provide Single Instruction Multiple Data processing. The L2 cache is 1 MB and connected to the cache coherent interconnect and controlled by the Snoop Control Unit.
Real-Time Application Unit (RPU) – Intended to enable the implementation of functional safety applications, the RPU provides dual 32-bit Cortex-R5F processors based on Arm-v7r architecture and include a floating-point unit. Each processor has 32 KB of L1 cache and tightly coupled memories with single-cycle read access. The TCM can be configured in two structures. In performance mode, the processors are independent and each processor has 128 KB of TCM. In lockstep mode, the processor is running in safety mode and 256 KB of TCM is provided.
I/O Peripheral (IOP) – The I/O peripherals are routed through the PS low-power domain or PMC Multiplexed IO (MIO). These MIO signals provide access from the PS to a range of standard peripherals including SPI, I2C, GPIO, CAN FD, UART, GigE, USB 2.0, QSPI, OSPI and eMMC. These interfaces enable us to work and interface with standard industry interfaces. The I/O peripherals, however, do not define the limits of the high-performance interfacing in Versal.
Network on Chip (NOC) – The NOC spans the entire device and is AXI4-based network capable of routing high bandwidth, real-time, and low-latency connections. The major NOC connections are the DDR controllers, PS to PL, PL to PL, and AI engine access.
Programmable Logic – This contains the high-performance parallel structures to implement custom high-speed designs. It mainly consists of DPS engines, configurable logic blocks and block memory, and UltraRAM. Connections between the PL and the PS domain are via either AXI interfaces (e.g., AXI4-Lite, AXI, ACE, ACP or the Network on Chip). Within the PL itself, there are several NOC channels which can be connected to. The PL also has interfaces to the AI engines and provides the integrated peripherals.
Integrated Peripherals – The integrated peripherals are the device options which are included in the PL fabric. These peripherals include the 100 G Ethernet MAC, 600 G Channelized Ethernet, 600 G Interlaken, 400 G Crypto Engine, Video Decoder Unit, and GTM transceivers.
Integrated Hardware Options – In addition to the integrated peripherals, Versal devices have integrated hardware options across the range of devices. These include the AI engines, accelerator RAM, coherency for PCIe module with CCIX, coherency for PCIe module with CXL, and high-bandwidth memory interface. Both the coherency for PCIe module with CCIX and coherency for PCIe module with CXL are referred to as the CPM in the block diagram (coherent module with PCIe).
Memory MapThe Versal global address map which provides 16 TB of address space and is based upon the Network on Chip (NoC). This global address map provides access and communication between all of the Versal architectural elements.
Such a large address space is segmented into several different regions which provide access to the processing system, CPM, DDR, PL access, and the AI engines.
The lower 4 GB of the address space contains the processing system and platform management controller registers. The first 2 GB of this address space is reserved for DDR memory access. The remaining 2 GB is split and provides system address space for the AXI PL LPD and FP interfaces, Octal SPI flash memory, and PCIe address region 0. The remaining 256 MB contains the configuration registers for the PS, PMC, and associated systems.
The entire address space can be visualized in the diagram below and the Versal technical manual (AM011) provides all the detailed information for the exact address locations.
The address space does, however, contain several points which we need to understand. The first of these is the PMC Alias. Several Versal devices contain stacked silicon interconnect, and as result, contain several PMCs. Each PMC has a NoC alias address which allows the optional PMCs to be addressed as necessary.
In our Versal applications, we may need a large data buffer off chip which is where the DDR memory system comes into play.
All Versal devices include at least one DDR Memory Controller (DDRMC) which is configured as part of the Network on Chip IP. Interfacing with the DDRMC are four NoC channels as shown in the address space. Each of these four NoC channels can have a different quality of service (QoS) applied, while the controller itself can be reconfigured as two separate units, both providing 32-bit DDR4 interfaces. Splitting the memory controller can aid performance.
The DDRMC provides the ability to perform read reordering. If you are not familiar with the practice of reordering, it is where the DDRMC will reorder transactions to improve memory access efficiency. The DDRMC in Versal has four states for the read back reordering.
Read Priority: Read transactions are given priority – default.
Write Priority: The number of write commands have exceeded a threshold and so writes will take priority until the pressure is lowered.
Write/Read: Efficient transactions are the priority in this state.
Starved: One or more read transactions are starved.
The DDRMC also supports a single error correcting double error detecting ECC. This is stored in an additional byte of memory (64-bits becomes 72, and 32-bits becomes 40). This allows single bit errors to be corrected on the fly and double bit errors to be reported.
When we are working with critical or high-reliability applications, we do not want single errors to evolve into double bit errors. To prevent this, background scrubbing is possible using the idle cycles of the controller. The scrubbing period for DDR4 can be set in the GUI versus being a fixed period for other supported memories.
ApplicationIn this project we are going to create a simple bare metal Versal application which uses the PS, NoC, DDRMC and AXI BRAM controller connected to the NoC to demonstrate the best way to get started.
I do realise that Versal boards are expensive however, I have written this project to provide education / Knowledge about Versal for those who are interested.
I will be following this up with a project which looks at the creating and working with Petalinux.
The Target board for this application is the VMK180.
Vivado DesignTo get started in Vivado we first need to create a project
Select the VMK180 board and when the project is opened create a new block diagram.
Leave the defaults unchanged
Once the block diagram is open, select the + to add in a IP block and enter CIP. This will create the Control and interfaces processing System.
Run the block automation to configure the CIPS for the VMK180
Select a memory controller and new NoC
This will create an updated block diagram with the CIPS configured correctly and the DDRMC configured also over the NoC.
Double click on the AXI NoC and on the output channel enable one AXI Master interface. It is to this interface we will connect the AXI BRAM Controller and BRAM
Add in a new AXI BRAM Controller
Reconfigure the block to support only one BRAM
Open the CIPS module for reconfiguration - Click on PS/PMC
On the clocks tab select output clocks and enable the PL Clock 0
On the PS/PL interfaces select number of resets to be 1
Run the connection automation
Connect a processor reset block and connect the resets to the AXI BRAM controller and the AXI NoC.
Run the validation for the design
Examine the addressing
Create the wrapper for the design
Compile the device image
Export the design including the Device Image to Vitis.
From the Vivado tools menu open Vitis IDE and select the workspace
Once Vitis opens select create application project
Select the previously exported XSA to work on
Enter a project name - targeting the first A72
Leave the domain unchanged
Select the hello world application - we will use this as a base and initially use it to pipe clean the process
Build the project
Run the application on the board.
Running this on the VMK180 should show the following
Vitis will show the single steps
If you are ussure how to work with the BRAM over AXI / NoC the BSP tab will show links to documentation and examples for all the hardware in the design.
We can use use the code below to read and write to AXI BRAM over the NoC.
#include <stdio.h>
#include "platform.h"
#include "xil_printf.h"
#include "xparameters.h"
#include "xbram.h"
#define BRAM_DEVICE_ID XPAR_BRAM_0_DEVICE_ID
XBram Bram;
int main()
{
XBram_Config *ConfigPtr;
u64 baseaddr = XPAR_BRAM_0_BASEADDR;
u32 input;
int i;
init_platform();
// disable_caches();
print("Hello World\n\r");
print("Successfully ran Hello World application");
ConfigPtr = XBram_LookupConfig(BRAM_DEVICE_ID);
XBram_CfgInitialize(&Bram, ConfigPtr, ConfigPtr->CtrlBaseAddress);
for ( i = 0; i< 128; i++){
Xil_Out64(baseaddr,i);
baseaddr = baseaddr + 8;
}
baseaddr = XPAR_BRAM_0_BASEADDR;
for (i = 0; i< 128; i++){
input = Xil_In32(baseaddr);
baseaddr = baseaddr + 8;
if(input != i ){
printf("error\n\r");
}
}
cleanup_platform();
return 0;
}
ConclusionVersal ACAP are very impressive devices, the NoC also enables significant timing improvement in the PL. Of course the AI engines are also game changing.
We will come back to the Versal in a project soon and look at how we can create petalinux solutions.
Comments
Please log in or sign up to comment.