When AMD-Xilinx upgraded their design tool environment from Vivado + XSDK to the Vivado + Vitis Unified Software Platform, one of the main functionalities this facilitated was the development of FPGA acceleration.
A common pitfall in the learning/understanding of FPGA acceleration is that there is a lot of text out there that uses the terms FPGA acceleration and offloading interchangeably despite the fact that they are very different functionalities.
Offloading is more simplistic in that it is just the implementation of a function in the programmable logic (PL) of an FPGA by a straight conversion of high level C/C++ code to RTL using something like Vitis HLS.
True acceleration on FPGA however is the modification of a data processing algorithm to take advantage of the processing capabilities of an FPGA's programmable logic. Then the compilation of the algorithm's code for the specific embedded (FPGA) target with the hooks for the rest of the software running on the CPU to send data to and receive data from the algorithm running in the PL.
Before getting into the specifics of how to restructure an algorithm for acceleration, it's important to understand the mechanics of how to structure a design in Vivado/Vitis for FPGA acceleration. This can be broken down into three main steps:
- Hardware Hooks: Create Extensible Vitis Platform in Vivado
- Software Hooks: Add XRT acceleration kernel to embedded Linux in PetaLinux
- Create Vitis platform and test/validate its functionality
This project post will focus on the first step of configuring the hardware hooks for an accelerated design in Vivado. The details of how to configure the hardware design for acceleration do differ based on the architecture of the target AMD-Xilinx FPGA (i.e. - Zynq-7000 vs Zynq UltraScale+ MPSoC, etc.). And to complicate things further, the Kria KV260 is a bit of a unique case in the Vitis acceleration workflow (will explain at relevant points in project), so I'll be focusing on the KV260 specifically for this first series of projects this month then creating tutorial series that are more general for Zynq UltraScale+ MPSoC and Zynq-7000 based boards in the coming months (one series for the Zynq UltraScale+ MPSoC and another series for Zynq-7000).
This project series will be done in Vivado/Vitis version 2021.2 and the steps should translate directly for 2021.1 as far as I'm aware. However, they are not a direct translation to versions 2019.x, 2020.x, nor 2022.x. This workflow is still on the newer side from AMD-Xilinx so I've noticed it varies between major versions more than usual. My educated guess though is that this will stabilize moving forward however. Furthermore I chose to do this project series in 2021.2 vs 2022.1 because at the time of writing, not all software components for the Kria KV260 were available in the 2022.1 version.
This workflow works on both Ubuntu 18.04 and Ubuntu 20.04 host PC operating systems.
Create Vivado ProjectLaunch Vivado and create a new project targeting the Kria KV260 board with the only difference from normal being that on the Project Type page, be sure to check the option for Project is an extensible Vitis platform.
Then select the Kria KV260 as the target Xilinx part for the project:
And select the carrier board connections by clicking the Connections hyperlink under the Kria KV260 Vision AI Starter Kit title and selecting Vision AI Starter Kit carrier card for Connector 1 on kv260 pop-up window.
Once the new blank Vivado project has loaded, select the option Create Block Design from the Flow Navigator window. You can give the block design any desire name, I named mine kria_bd.
Start by adding the Zynq MPSoC IP block and running the subsequent block automation that appears after adding it to the diagram.
Every accelerated design needs the following hardware resources at minimum:
- Clock resources
- Interrupt(s)
- AXI interfaces to each desired peripheral such as DDR memory.
To cover the clocking resources, add a Clocking Wizard IP to the block design. Re-customize it (double-click on it in the diagram) to change the reset from active high to active low and have three output clocks of 100 MHz, 200 MHz, and 400 MHz.
Connect clk_in1 of the Clocking Wizard to pl_clk0 of the Zynq MPSoC, and resetn of the Clocking Wizard to pl_resetn0 of the Zynq MPSoC.
Manually add three Processor System Reset IP blocks to the design (one for each of the output clocks of the Clocking Wizard) and connect each clk_out from the Clocking Wizard to slowest_sync_clk of its own Processor System Reset. Then connect each ext_reset_in of the Processor System Reset IPs to pl_resetn0 of the Zynq MPSoC.
Now the following clock structure configuration is done so to make the hardware design compatible with the reference accelerated applications provided by AMD-Xilinx:
Connect the 200 MHz clock (clk_out2 from the Clocking Wizard) to maxihpm0_lpd_aclk of the Zynq MPSoC. This means the PL design accessed by the accelerated kernel will be running at the clk_out2 rate of 200 MHz.
Next, for the interrupt, add an AXI Interrupt Controller IP to the block design. Double-click on the IP after adding it to the design in order to bring up the window to customize its settings. Change the Interrupt Output Connection to Single then click OK to close it, and run the connection automation that appears for the AXI bus.
Connect the irq output from the AXI Interrupt Controller to the pl_ps_irq0[0:0] input of the Zynq MPSoC.
Then make sure clk_out2 and its peripheral_aresetn signal is the clock feeding all of the master and slave interfaces on the AXI Interconnect (you may manually have to do this depending on the results of the connection automation).
You probably noticed the extra AXI peripheral in my block design in the above screenshots. This is optional, and up to you if you want to implement. I decided to leave it in for future use with accelerated application development. It is the HLS IP PMOD controller I previously documented in a past project here. Note that it is also being clocked by clk_out2 from the Clocking Wizard.
Platform SetupOnce the appropriate IP blocks have been placed & configured in the block design, the specifics of what will be made available to the acceleration kernel (and overall software) is specified in the Platform Setup tab within the Block Design.
Starting with the AXI Port settings tab, you'll see that the AXI ports that you have the option to make available to the accelerated kernel are broken up in to two major groups: the AXI ports on the Zynq MPSoC processing system itself and the AXI ports on the AXI interconnect.
In this case with the Kria KV260, check the Enable box for all of the AXI ports on the Zynq MPSoC PS (zynq_ultra_ps_e_0) to be available to the accelerated kernel with the exception of the S_AXI_LPD port.
Set the Memport for M_AXI_HPM0_FPD and M_AXI_HPM1_FPD to M_AXI_GP and leave the SP Tag for them blank.
Set the Memport for S_AXI_HPC0_FPD, S_AXI_HPC1_FPD,S_AXI_HP0_FPD,S_AXI_HP1_FPD,S_AXI_HP2_FPD, & S_AXI_HP3_FPD to S_AXI_HP. Then set each to their respective SP Tag, for example set the SP Tag for S_AXI_HPC0_FPD to HPC0.
Then enable the desired number of AXI master interfaces for the accelerated kernel to use on the AXI Interconnect. I enabled 7 total (M02_AXI - M08_AXI since I have two peripherals using M00_AXI and M01_AXI). The choice of 7 was to match the reference designs from AMD-Xilinx for the KV260. Leave the Memport for each to the default M_AXI_GP and the SP Tag blank.
Next, under the Clock settings tab, check the Enable box for the three output clocks from the Clocking Wizard (clk_out1 - clk_out3). Regardless of the number of clocks enabled, one must be designated as the default clock for the accelerated kernel. In this case, I'm using clk_out2 as the default since it's what the PL in the block design is being clocked by (so the Is Default is checked for clk_out2). Also, change the IDs of clk_out1, clk_out2, & clk_out3 to 0, 1, & 2 respectively.
Under the Interrupt settings tab, check the Enable box for the intr output from the AXI Interrupt Controller.
At this point, the hooks for the accelerated kernel to have access to the three output clocks from the Clocking Wizard, the interrupt from the AXI Interrupt Controller, 7 general purpose master AXI interfaces, and the DDR Memory on the board via the high performance AXI slave ports (S_AXI_HP*) of the Zynq MPSoC. So to wrap up this base hardware design, validate and save the block design.
Validate by clicking the check box icon at the top of the block design window, then the save icon along the menu bar after the design reports back from validation.
The critical warning that the intr[0:0] input of the AXI Interrupt Controller is not connected can be safely ignored for now.
Create HDL WrapperIn the Sources tab window, right-click on the block design file and select the option Create HDL Wrapper...
Then in the window that appears, select the option to allow Vivado to auto-manager the HDL wrapper file.
Next, click Generate Block Design in the Flow Navigator window. Change the Synthesis Options from Out of context per IP to Global then click Generate.
After some trial and error, I found that the design checkpoint file automatically created by the incremental synthesis method in Vivado causes issues in the Vitis platform and prevents the v++ compiler from building the accelerated kernel. So before running synthesis, implementation, and generating a bitstream, incremental synthesis needs to be disabled.
Open the Settings window from the Flow Navigator and select the Synthesis tab under Project Settings. Click the ... next to Incremental synthesis to bring up its settings window.
Select the option to Disable incremental synthesis and click OK to return to the main Settings window. Then click OK on the Settings window to close it.
Run synthesis, implementation, and generate a bitstream for the design. If you select Generate Bitstream from the Flow Navigator window, Vivado will automatically run synthesis and implementation as needed to save you a few button clicks.
Export PlatformOnce the bitstream has been successfully generated, the platform needs to be exported in Xilinx's archive file format (.xsa) to pull into PetaLinux and the Vitis software development IDE to generate the accelerated application software.
Click File > Export > Export Platform or select Export Platform from the Flow Navigator window. This brings up the Export Hardware Platform wizard.
Click Next on the first information page then select Hardware on the second page before clicking Next again.
Select Pre-synthesis for the Platform State and check the box to Include bitstream in the exported platform then click Next.
Give the platform the desired Name, Vendor, Board, Version, and Description information and click Next.
Name the XSA file kv260_custom_platform, leave the export directory as the default option, and click Finish.
This completes the hardware step of developing an accelerated application on the Kria KV260, next will be the first step in adding the software for the accelerated kernel itself by enabling XRT in the embedded Linux image using PetaLinux.
Comments