Published April 6, 2022

Super96s Clusters - Part 3

Creating a series of modules allowing you to connect for the first time a PYNQ acceleration distribution network for edge devices U96.

AdvancedFull instructions provided3 hours201

Things used in this project

Hardware components

Tria Technologies Ultra96-V2

Software apps and online services

AMD Vivado Design Suite

AMD PetaLinux

AMD Vitis Unified Software Platform

Jupyter Notebook

Story

Part 3 | Acceleration Platform Overlay - Vivado 2020.2

----------------------------------------------------------------------------------------------------------

Link

Schematic: https://www.avnet.com/opasdata/d120001/medias/docus/193/Ultra96-V2%20Rev1%20Schematic.pdf

Board files: GitHub - Avnet/bdf: Avnet Board Definition Files

----------------------------------------------------------------------------------------------------------

The objective of this module is to integrate an acceleration platform overlay on the PYNQ framework. We have the hardware (Vivado) that allows the connections to exist and the software (Vitis) finishes the connections for a complete design. This is then transported to PYNQ

The integration is in 2 parts, the 1st step is the setup of the hardware to instantiate the hooks for the acceleration platform (it is seen in Vivado as incompleted - that is normal), see it as more of a foundation or more of a wall socket metaphor.

We have the connections created, allowing 2nd part which is the software attaching to thehardware foundation which are the hooks/plug into the wall socket. The hooks connect to the software application. (This completes the hardware design through automated processes in the software), we take what is existed from the template (a placeholder for your/user applications)

Once we have a completed design with both the hardware in Vivado and the software in Vitis, we migrate the project as a hwh/tcl/bit file. We can then interface with the block/template instantiated allowing us to use the function of the block. In this case it is an addition function A + B = C.

HW - Creating Block Diagram

When creating the hardware module, we will take a look and leverage an existing project flow which explains in detail how to setup our hardware to thus connect up the software. Refer to the sections accordingly.

High Level we are accomplishing an addition vector algorithm: A + B = C

As for a flow diagram, we are running through the hardware process showing the cycle chronologically. This allows the PS to interface with the PL nodes we have placed. Having 5 different resets that we can hook up.

Requirements that we need in making an Acceleration Platform - NOTE - Follow specific instructions. You have to have the following in the design or the acceleration platform will not operate correctly.

Memory & Controller Interface - MPSoC
CLock
Interrupt

We build the vivado platform, create our rootfs, fsbl, pmwfw, uboot, sysroot, system, dtb. This is then pathed and registered in the Vitis platform project to allow the new application to be run with custom code, knowing it has MPSoC build based from the petalinux, below describes the process in a simple flow.

To inquire more about Petalinux and Vitis refer to resources (UG1144 |UG1393)

------------------------------------------------------------------------------------------------------

Follow Adam Taylor's flow

(https://www.hackster.io/adam-taylor)

Adam Taylor

Creating Vivado XSA

Once in the Vivado directory, we can start Vivado and create a new project — make sure we target the Ultra96 V2 board.

With the project open, the next step is to create the block diagram.

Project Open and Ready for Creation

When the create block diagram dialog appears, leave the name as design_1 and click OK.

Creating the block diagram

In the block diagram, we need to add the MPSoC processing system and configure it for the Ultra96 V2 by running the block automation.

Adding in the MPSoC IP Core

Running the block automation

Now that we have the processing system configured for a base Ultra96 V2 project, we can configure it how we need for an acceleration XSA by re customizing the MPSoC IP.

The first step is turn off the AXI HPM0 and HPM1 FPD interfaces.

Disabling the interfaces on the Ultra96 default settings

With the MPSoC configured as we desire, the next step is to implement the clocking and reset structures.

Let's start with adding in a Clocking Wizard IP Block and re-configuring it to provide five outputs. Increasing in frequency from 100 MHz to 400 MHz make the fifth clock output 600MHz.

Setting clock frequencies

To ensure the reset on the clocking wizard is compatible with the MPSoC IP block reset, we need to set the clocking wizard reset to be active low.

Setting the clock wizard reset to be active low

To function correctly in the Vitis acceleration flow, each clock needs an associated reset. So now let's address this.

Adding in the reset

To add in a processor system reset IP, we need one of these for each of the clock outputs on the clocking wizard. Copy and paste this four times so we have a total of five in the block diagram.

At this stage our block diagram should look as below:

All IP blocks within the block diagram

Run the connection automation and associate each processor reset block, slowest sync clock with one of the clocking wizard clocks.

Setting the slowest clocks in the run connection automation

Set each of the ext_reset_in signals to the pl_resetn0 output from the MPSoC.

setting the connection automation configuration

Once the automation has completed, the final stage is to connect the dcm_locked inputs on the processor reset systems to the locked output on the clocking wizard.

Finally, add a Concat IP block and connect it to the PL-PS_IRQ input. Ensure there is only one input on the concat block.

Completed block diagram

Having completed the base platform, the next stage is to declare capabilities which will be made available or not to the V++ compiler.

To do this, first we need to enable the platform view — this is enabled by selecting:

Window -> Platform Interfaces

Opening the platform interface

This will create a new Platform Interfaces window within the block diagram.

Completed block diagram

Click on the Enable Platform Interfaces, and you will see a list of the available interfaces under each of the elements in the design.

These can be enabled or disabled by right clicking on the interface. For the Ultra96 V2 MPSoC, ensure the interfaces below are enabled.

Enabling platform interfaces

Select clk_out3 and in the options below set the ID to 0 and make it the default clock.

Setting the clock 3 as the default

Finally, enable In0 to in7 on the Concat block.

We also need to set the design intent to show where we are intending to deploy the design enter the commands below in the TCL console.

set_property platform.design_intent.embedded true [current_project]set_property platform.design_intent.server_managed false [current_project]set_property platform.design_intent.external_host false [current_project]set_property platform.design_intent.datacenter false [current_project]set_property platform.default_output_type "sd_card" [current_project]

Save the project, validate the block diagram, and generate a HDL wrapper for the block diagram.

Successful validation

Once we have the HDL wrapper implement the bit stream.

Wait until the bitstream is available, then export and validate the XSA with the following TCL commands:

write_hw_platform -include_bit ultra96_min.xsavalidate_hw_platform ./ultra96_min.xsa

Validation of the XSA

Now that we have the hardware element of the platform created, we can start looking at the software element next.

------------------------------------------------------------------------------------------------------

SW - Creating Block Diagram (Follow Adam Taylor's flow)

Creating Petalinux

Before we can install PetaLinux, we need to ensure we have necessary prerequisites. For the VM we created a few weeks ago, we can install them using the command:

sudo apt-get install -y gcc git make net-tools libncurses5-dev tftpd zlib1g-dev libssl-dev flex bison libselinux1 gnupg wget diffstat chrpath socat xterm autoconf libtool tar unzip texinfo zlib1g-dev gcc-multilib build-essential libsdl1.2-dev libglib2.0-dev zlib1g:i386 screen pax gzip gawk

Once the required packages have been installed, we can download and install PetaLinux.

Installing PetaLinux

Completion of installation

With PetaLinux available under the directory, we created for the platform previously we need to create three new directories PFM, WKSP1 and BOOT.

cd ultra96_min_pkgmkdir pfmcd pfmmkdir wksp1mkdir bootcd ..

In the same terminal window, we need to source the following files:

/settings64.sh
/settings.sh

We are now ready to create the PetaLinux project. Make sure the project name is the same as the hardware in this case ultra96_min.

Creating the new PetaLinux project

petalinux-create -t project --template zynqMP -n ultra96_min

We then need to configure the new project for the hardware design using the command:

cd ultra96_minpetalinux-config --get-hw-description=../vivado

This will open a configuration dialog, set the boot arg to

earlycon clk_ignore_unused root=/dev/ram rw

and stdin/stdout to pus_uart_1.

Configuring the hardware

Setting stdin / stdout

Once this is completed, save the changes and exit the dialog.

We need to make some changes to the meta-user Yocto layer, under the directory:

/project-spec/meta-user open the conf file and add in the required OpenCL requirements.

CONFIG_xrtCONFIG_xrt-devCONFIG_zoclCONFIG_opencl-clhpp-devCONFIG_opencl-headers-devCONFIG_packagegroup-petalinux-opencv

Editing the conf file

We also need to make changes to the user device tree, under:

/project-spec/recipes-bsp/device-tree/files

Edit the file system-user.dsti and add in the following:

/include/ "system-conf.dtsi"
/ {
amba {
    mmc@ff160000 {
        u-boot,dm-pre-reloc;
        compatible = "xlnx,zynqmp-8.9a", "arasan,sdhci-8.9a";
        status = "okay";
        interrupt-parent = <0x4>;
        interrupts = <0x0 0x30 0x4>;
        reg = <0x0 0xff160000 0x0 0x1000>;
        clock-names = "clk_xin", "clk_ahb";
        xlnx,device_id = <0x0>;
        #stream-id-cells = <0x1>;
        iommus = <0xd 0x870>;
        power-domains = <0xc 0x27>;
        clocks = <0x3 0x36 0x3 0x1f>;
        clock-frequency = <0xb2d0529>;
        xlnx,mio_bank = <0x0>;
        no-1-8-v;
        disable-wp;
        };
    };
};

&amba {
    zyxclmm_drm {
        compatible = "xlnx,zocl";
        status = "okay";
        reg = <0x0 0xA0000000 0x0 0x10000>;
        };
};

Updating the device tree

Once these edits have been made, open the rootfs configuration and enable the user packages.

petalinux-config -c rootfs

Enabling the user packages

To be able to use the platform for acceleration, we need to make a few changes to the kernel configuration:

petalinux-config -c kernel

Make the following changes:

Device Drivers > Generic Driver Options > DMA Contiguous Memory Allocator > Size in Mega Bytes change the size from 256 to 1024 MB
Device Drivers -> Staging drivers -> Xilinx APF Accelerator driver
Device Drivers -> Staging drivers -> Xilinx APF Accelerator driver -> Xilinx APF DMA engines support

Once this has been completed, we then ready to build the PetaLinux image.

petalinux-build

Building the PetaLinux Image

Wait until the build completes and we can then create the sysroot, change directory into the /images/linux/directory.

petalinux-build --sdk

We will use this to install the sysroot in the pfm directory run the command:

./sdk.sh

When prompted, enter the full path to the pfm directory.

Sysroot installer

Finally, copy the following files from the /images/linux directory to the boot directory.

image.ub
zynqmp_fsbl.elf
pmufw.elf
bl31.elf
u-boot.elf

Copying the boot files

We also need to create a linux.bif file under the boot directory this should contain the following.

/* linux */ the_ROM_image: {     [fsbl_config] a53_x64     [bootloader] <zynqmp_fsbl.elf>     [pmufw_image] <pmufw.elf>     [destination_device=pl] <bitstream>     [destination_cpu=a53-0, exception_level=el-3, trustzone] <bl31.elf>     [destination_cpu=a53-0, exception_level=el-2] <u-boot.elf> }

------------------------------------------------------------------------------------------------------

Creating Vitis Platform Project

With all of this completed, we are now in a position to open Vitis and begin to create the platform. From within the pfm directory, run the following commands:

vitis -workspace wksp1

This will open the Vitis GUI. From under the project column, select Create Platform Project.

Vitis welcome screen

This will open a new platform project dialog, enter the project name and click next.

Dialog for project creation

On the next dialog, select the create from hardware specification.

Selecting the platform definititon source

On the next dialog, select the XSA which is under the Vivado directory.

Selecting the XSA

Select the operating system as Linux and Processor as the PSU_Cortexta53.

Defining the SW solution

Completing the dialog will open a platform project in Vitis. To be able to build the application, we need to provide the location of the BIF, boot directory, Linux image and sysroot — all of which are available under the PFM directory.

Platform specification

With the information provided, we can build the application project. This might take a minute or two.

Building the Ultra96 Vitis acceleration platform

Selecting the new platform

For the application, select the demo example, change the target to hardware, and build the application.

This took about 30 minutes on my system.

Acceleration application build completed

We now have a platform which we can use to accelerate our OpenCL applications on for the Ultra96 V2.

See My FPGA / SoC Projects:Adam Taylor on Hackster.io

Get the Code:ATaylorCEngFIET (Adam Taylor)

Access the MicroZed Chronicles Archives with over 300 articles on the FPGA / Zynq / Zynq MpSoC updated weekly atMicroZed Chronicles.

------------------------------------------------------------------------------------------------------

SW - Application Addition Vector

Create a new application project and select our new platform that was build and created in the last step. Now that the hardware foundation has been constructed, we now have to hook in our kernel linker and edit the source code.

Creating Vitis Application Project

Getting started - Welcome Screen

Select your exported XSA acceleration platform

Select the template - Vector Addition Example

Edit the template such that you can strip out the content you do not need and change the pointers to integers

Now time to link the kernel_Vector Add to the project - The compute Unit next to Name is how many kernels you want to drop in. Since we are only using 1 kernel --> we can leave the Compute Unit = 1

Configuration time! Here is where things can become messy. We have to now create a.config file with the system link where the compiler can stitch in the new kernel that has been brought in.

Create a new file --> vector_add-link.cfg

Now we have a V++ compiler, we can see this kernel be brought into our design.

Build project - this may take a while. Give it 10-25 mins depends on the machine you are working on. NOTE - Let it think if it gets "stuck"

Files that are generated:

TCL File
HWH File
Bit File
XCLBIN File

Find the Link_Summary and double click this option to then verify if your hardware is behaving as you are expecting.

At a high level, this is what we expected! You have 2 inputs & 1 output from the source code. We can now go further and open Vivado which is living under the hood in Vitis. Let me show you how you complete this.

Once you see design_1_bd.tcl - Double click it! This will open a new Vivado window which shows you the complete Platform from the combination of Hardware and Software. The software automates this step.

This is awesome! The best part is that you configured your project such that the tool could work its magic.

Vitis to PYNQ

Since we have a working project and we can see that the correct files exist, let's use [WinSCP] and copy over the files to the Ultra96.

Now let's transpose the files you need - Here are their locations:

TCL File

vitis_accel_add_sw_v3/U96_platform_system_hw_link/Hardware/binary_container_1.build/link/vivado/vpl/prj/prj.runs/impl_1

HWH File

vitis_accel_add_sw_v3/U96_platform_system_hw_link/Hardware/binary_container_1.build/link/vivado/vpl/prj/prj.runs/impl_1

Bit File

vitis_accel_add_sw_v3/U96_platform_system_hw_link/Hardware/binary_container_1.build/link/vivado/vpl/prj/prj.runs/impl_1

XCLBIN File

vitis_accel_add_sw_v3/U96_accel_app_system/Hardware/

We are accomplishing the following code: A + B = C. With PYNQ we have to set up and allocate space for memory to be allocated. This then creates a simple array of 4 data points. We expect the kernel to operate the add function and alter the two inputs of array A and array B.

SW Code

from pynq import Overlay
import pynq
import time
from pynq import allocate
from pynq.lib.dma import DMA
import numpy as np
# from PIL import Image
from IPython.display import Image
import matplotlib.pyplot as plt
# Import libs 
from pynq import Overlay 
from pynq import Device
from pynq import DefaultIP
for i in range(len(pynq.Device.devices)):
    print("{}) {}".format(i, pynq.Device.devices[i].name))
0) Ultra96
1) Ultra96
overlay_1 = Overlay("bitfile_hwh_files/vitis_accel_add.xclbin", device=pynq.Device.devices[1])
overlay_1?
#-----------------------
#Bring the bit file in [Do Not NEED]
#-----------------------

# overlay_1 = pynq.Overlay('my_overlay1.bit', device=pynq.Device.devices[0])


# overlay_1 = Overlay("bitfile_hwh_files/test_2.bit")
# overlay_1?

# overlay_2 = Overlay("bitfile_hwh_files/test_3.bit")
# help(overlay_2)

# overlay_3 = Overlay("bitfile_hwh_files/vitis_accel_add.bit")
# overlay_3?

# overlay = Overlay("test_2.xclbin", device=pynq.Device.devices[1])
# overlay?
# Creating a driver for the new kernel function vadd
# Verify you can run this Ip
add_ip = overlay_1.krnl_vadd_1          
# add_ip?
add_ip.register_map
RegisterMap {
  CTRL = Register(AP_START=0, AP_DONE=0, AP_IDLE=1, AP_READY=0, AUTO_RESTART=0, AP_CONTINUE=0),
  in1 = Register(value=1263337472),
  in2 = Register(value=1924894720),
  out_r = Register(value=967557120),
  size = Register(value=4)
}

Load in the allocation lib

#Running xrt accelerator code
import pynq
from pynq import allocate

# INITIALIZE - Allocation arroay of at least 5 32 integers
input_buf_1 = pynq.allocate(shape=(4,), dtype='u4')
input_buf_2 = pynq.allocate(shape=(4,), dtype='u4')

#output
outbuf_1 = pynq.allocate(shape=(4,), dtype='u4')
# outbuf_2 = pynq.allocate(shape=(4,), dtype='u4')

#initialize the buffers
for i in range(4):
    input_buf_1 [i] = i
    input_buf_2 [i] = i + 1

# Write to size
# add_ip.register_map.size = 4
print(input_buf_1)
print(input_buf_2)
[0 1 2 3]
[1 2 3 4]
add_ip.register_map
RegisterMap {
  CTRL = Register(AP_START=0, AP_DONE=0, AP_IDLE=1, AP_READY=0, AUTO_RESTART=0, AP_CONTINUE=0),
  in1 = Register(value=1263337472),
  in2 = Register(value=1924894720),
  out_r = Register(value=967557120),
  size = Register(value=4)
}

Sync to Device & Start Kernel Call

#XRT Framework
input_buf_1.sync_to_device()
input_buf_2.sync_to_device()
# Input 1 | input 2 | output | size
overlay_1.krnl_vadd_1.call(input_buf_1, input_buf_2, outbuf_1, 4)
#Send the data XRT - START
handle = overlay_1.krnl_vadd_1.start(input_buf_1, input_buf_2, outbuf_1, 4)
handle.wait()
outbuf_1.sync_from_device()
outbuf_1
PynqBuffer([1, 3, 5, 7], dtype=uint32)

Output Vitis Acceleration Overlay

Array_A = [ 0, 1, 2, 3]

Array_B = [ 1, 2, 3, 4]

Addition_Array_C= [1, 3, 5, 7]

Vitis Overlay

# coding: utf-8

# In[88]:


from pynq import Overlay
import time
from pynq import allocate
from pynq.lib.dma import DMA
import numpy as np
# from PIL import Image
from IPython.display import Image
import matplotlib.pyplot as plt


# In[89]:


# Import libs 
from pynq import Overlay 
from pynq import Device
from pynq import DefaultIP


# In[87]:


for i in range(len(pynq.Device.devices)):
    print("{}) {}".format(i, pynq.Device.devices[i].name))


# In[90]:


overlay_1 = Overlay("bitfile_hwh_files/vitis_accel_add.xclbin", device=pynq.Device.devices[1])
get_ipython().magic('pinfo overlay_1')


# In[75]:


#-----------------------
#Bring the bit file in [Do Not NEED]
#-----------------------

# overlay_1 = pynq.Overlay('my_overlay1.bit', device=pynq.Device.devices[0])


# overlay_1 = Overlay("bitfile_hwh_files/test_2.bit")
# overlay_1?

# overlay_2 = Overlay("bitfile_hwh_files/test_3.bit")
# help(overlay_2)

# overlay_3 = Overlay("bitfile_hwh_files/vitis_accel_add.bit")
# overlay_3?

# overlay = Overlay("test_2.xclbin", device=pynq.Device.devices[1])
# overlay?


# In[91]:


# Creating a driver for the new kernel function vadd
# Verify you can run this Ip
add_ip = overlay_1.krnl_vadd_1          
# add_ip?


# In[92]:


add_ip.register_map


# # Load in the allocation lib

# In[93]:


#Running xrt accelerator code
import pynq
from pynq import allocate

# INITIALIZE - Allocation arroay of at least 5 32 integers
input_buf_1 = pynq.allocate(shape=(4,), dtype='u4')
input_buf_2 = pynq.allocate(shape=(4,), dtype='u4')

#output
outbuf_1 = pynq.allocate(shape=(4,), dtype='u4')
# outbuf_2 = pynq.allocate(shape=(4,), dtype='u4')

#initialize the buffers
for i in range(4):
    input_buf_1 [i] = i
    input_buf_2 [i] = i + 1
    
# Write to size
# add_ip.register_map.size = 4


# In[94]:


print(input_buf_1)
print(input_buf_2)


# In[95]:


add_ip.register_map


# ## Sync to Device & Start Kernel Call

# In[96]:


#XRT Framework
input_buf_1.sync_to_device()
input_buf_2.sync_to_device()


# In[82]:


# Input 1 | input 2 | output | size
overlay_1.krnl_vadd_1.call(input_buf_1, input_buf_2, outbuf_1, 4)


# In[97]:


#Send the data XRT - START
handle = overlay_1.krnl_vadd_1.start(input_buf_1, input_buf_2, outbuf_1, 4)
handle.wait()


# In[98]:


outbuf_1.sync_from_device()


# In[99]:


outbuf_1


# # ----- END of DOCUMENT ------