Hardware Design in Vivado
Export Platform
Create PetaLinux Project
Enable XRT in Root Filesystem
Build PetaLinux Project with SDK
Create Linux Components Directory
Create Vitis Workspace
Vitis Platform Project
Vitis Application Project
Launch Debugger on Emulator

Published September 2, 2022 © GPL3+

Debugging Accelerated Apps using Emulation in Vitis 2021.2

This project walks through how to set up a generic project in Vitis 2021.2 to debug FPGA accelerated applications using SW & HW emulation.

AdvancedFull instructions provided3 hours1,616

Debugging Accelerated Apps using Emulation in Vitis 2021.2

Things used in this project

Hardware components

Zynq UltraScale+ MPSoC ZCU104

Software apps and online services

AMD Vitis Unified Software Platform

AMD PetaLinux

AMD Vivado Design Suite

Story

As I've been diving into the world of accelerated applications on AMD-Xilinx FPGAs, I've come to find emulation of my accelerated applications as an invaluable tool.

The only catch is that as far as I've found, for boards that boot from QSPI like the Kria KV260/KR260 and the Zynqberry it's not possible to to emulate Linux applications in Vitis. Boards that don't have QEMU support also cannot have Linux-based applications emulated in Vitis since QEMU is what the emulator is using for Linux in Vitis.

Thus, I've created this project on the ZCU104 (I don't currently own a ZCU104, but I'm emulating so I don't need it!) to be able to emulate some of the elements of my accelerated application that I am developing for my Kria boards. I selected the ZCU104 since it is Zynq UltraScale MPSoC based, has good QEMU support, and doesn't require an additional Vivado license.

Note: I am using version 2021.2 in this project post. So these steps are only directly applicable to 2021.1 and 2021.2 (meaning I can't guarantee the following steps will work in any other versions besides 2021.x). I know that the 2022.1 workflow is a bit different and I will post an updated version of this when I get it figured out.

Hardware Design in Vivado

For the hardware design, I started with the same based I have for my previous accelerated designs on the Kria in terms of clocking resources, interrupt controller, and available AXI interfaces (see exact details here, the setup of the block design is exactly the same between the Kria and the ZCU104 in this case)

A quick rundown of the block design: start by adding Zynq UltraScale+ Processing System IP block and run the corresponding block automation to apply the board presets. Then add a clocking wizard IP with desired number of reference clocks for the accelerated kernel to use, each with their own processor system reset IP.

Add an AXI Interrupt Controller and select the clock you plan to set as the default for the accelerated kernel in the Platform Setup tab to drive it in the connection automation tool.

Since DDS compilers are a common IP of choice for me, I decided that I should test out how driving one from the Vitis accelerated kernel was like. It's also a good bench mark since I've posted two previous projects to highlight the difference between driving one from the programmable logic vs embedded C from a bare metal application running on an ARM-core of the Zynq.

So I added a DDS compiler IP to the block design. Since I plan to have the accelerated kernel driving both the phase input and reading out the data waveform, the only port connected in the block design is the clock port (which again is the clock set as the default for the accelerated kernel in the Platform Setup tab).

For the DDS compiler IP settings itself, I set it to stream the input phase increment value with no offset option. The output is just the data value of a cosine waveform with no pack framing on the AXI Stream interface.

Configure the Platform Setup tab to enable the desire AXI ports, clocks, and interrupt for the accelerated kernel to have access to.

For the AXI Steam Port tab, enable the master and slave ports of the DDS compiler and give them the appropriate SP Tag names. I personally found using the port names of the IP block in the block design the most helpful since the SP Tag is used as a reference designator in Vitis by the kernel.

Enable the desired clocks for the accelerated kernel to use and set the default clock.

Then enable the interrupt from the AXI interrupt controller:

Then give the platform the desired name, board info, etc.:

Validate the block design, save it, and then generate the block design (select Generate Block Design from the Flow Navigator window). Create an HDL wrapper then run synthesis, implementation, and generate a bitstream for the design.

Export Platform

Export the platform by selecting the Export Platform option from the Flow Navigator window. Be sure to select an option to include hardware emulation, I always choose the option that includes hardware and emulation so I don't have to bother with re-exporting if I do decide to implement the design on hardware.

Check the option to include the bitstream in the exported platform.

The Platform Properties can be left as their default values.

The platform can be exported to any desired directory location, but I like to use the default location in the top level of the Vivado project directory. I also am sure to use the platform name for the XSA file name.

The top level of the Vivado project directory is also where I like to keep everything else for the accelerated design in the following steps including the PetaLinux project, Linux components directory, and Vitis workspace.

Create PetaLinux Project

The emulation in Vitis I plan to use is emulating a Linux OS for the target device (since the end goal is a Linux application), thus the proper Linux components need to be generated through building a PetaLinux project before getting into Vitis.

Source the PetaLinux tools and create a project targeting the appropriate Zynq platform (the ZynqMP since I'm using the ZCU104) in the desired directory (top level of the Vivado project directory in my case as mentioned previously which is ~/zcu104_prj). Then change directories into the newly created PetaLinux project.

~$ source /tools/Xilinx/PetaLinux/2021.2/settings.sh
~$ cd ./zcu104_prj/
~/zcu104_prj$ petalinux-create --type project --template zynqMP --name linux_os
~/zcu104_prj$ cd ./linux_os/

Import the hardware platform exported from Vivado (again, I exported the XSA platform file to the default location in the Vivado project top level directory):

~/zcu104_prj/linux_os$ petalinux-config --get-hw-description ../

Leave all of the hardware system settings set to their defaults, but if you're also using the ZCU104, update the machine name under DTG Settings to zcu104-revc.

DTG Settings --> (zcu104-revc) MACHINE_NAME

Exit the system configuration editor and opt to save the changes.

Enable XRT in Root Filesystem

A few package dependencies need to be added to the root filesystem to support emulation such as XRT and ZOCL.

Launch the root filesystem configuration editor:

~/zcu104_prj/linux_os$ petalinux-config -c rootfs

Enable XRT, ZOCL, and OpenCL headers under filesystem packages as shown below. Also enable the general PetaLinux package group, as well as the networking stack, OpenCV, and Utilities package groups. If there are any other package dependencies relevant to your accelerated application, enable them as well.

Filesystem Packages
--> libs
----> xrt
------ [*] xrt
------ [*] xrt-dev
----> zocl
------ [*] zocl
----> opencl-clhpp
------ [*] opencl-clhpp-dev
----> opencl-headers
------ [*] opencl-headers

Petalinux Package Groups 
--> packagegroup-petalinux
---- [*] packagegroup-petalinux 
--> packagegroup-petalinux-networking-stack
---- [*] packagegroup-petalinux-networking-stack 
--> packagegroup-petalinux-opencv
---- [*] packagegroup-petalinux-opencv 
--> packagegroup-petalinux-utils
---- [*] packagegroup-petalinux-utils

The ZOCL driver node also needs to be added to the device tree, so open system-user.dtsi with your preferred text editor:

~/zcu104_prj/linux_os$ gedit ./project-spec/meta-user/recipes-bsp/device-tree/files/system-user.dtsi

And add the following node along with any others relevant to your design:

/include/ "system-conf.dtsi"
/{
};

&axi_intc_0 {
	xlnx,kind-of-intr = <0x0>;
	xlnx,num-intr-inputs = <0x20>;
};


&amba {	
	zyxclmm_drm {
		compatible = "xlnx,zocl";
		status = "okay";
		interrupt-parent = <&axi_intc_0>;
		interrupts = <0  4>, <1  4>, <2  4>, <3  4>,
			     <4  4>, <5  4>, <6  4>, <7  4>,
			     <8  4>, <9  4>, <10 4>, <11 4>,
			     <12 4>, <13 4>, <14 4>, <15 4>,
			     <16 4>, <17 4>, <18 4>, <19 4>,
			     <20 4>, <21 4>, <22 4>, <23 4>,
			     <24 4>, <25 4>, <26 4>, <27 4>,
			     <28 4>, <29 4>, <30 4>, <31 4>;
	};
};

Save & close system-user.dtsi.

Build PetaLinux Project with SDK

Build the PetaLinux project to generate the components needed by QEMU in Vitis:

~/zcu104_prj/linux_os$ petalinux-build

Also build an SDK for the PetaLinux project to get a sysroot for compiling the accelerated application in Vitis:

~/zcu104_prj/linux_os$ petalinux-build --sdk

Create Linux Components Directory

The last step before jumping into Vitis is to create a directory structure to organize the necessary Linux components into.

Create a new directory in the desired location (top level Vivado project directory again for me), and boot and image directories:

~/zcu104_prj$ mkdir -p linux_components
~/zcu104_prj$ cd ./linux_components
~/zcu104_prj/linux_components$ mkdir -p ./src/boot
~/zcu104_prj/linux_components$ mkdir -p ./src/image

Copy bl31.elf, zynqmp_fsbl.elf, pmufw.elf, u-boot.elf, and system.dtb to the boot directory from the PetaLinux project (./linux_os/images/linux). Then rename zynqmp_fsbl.elf to fsbl.elf:

~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/bl31.elf ./src/boot
~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/zynqmp_fsbl.elf ./src/boot 
~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/pmufw.elf ./src/boot
~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/u-boot.elf ./src/boot
~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/system.dtb ./src/boot
~/zcu104_prj/linux_components$ cd ./src/boot
~/zcu104_prj/linux_components/src/boot$ mv zynqmp_fsbl.elf fsbl.elf

Then copy boot.scr, image.ub, and rootfs.cpio.gz to the image directory from the PetaLinux project (./linux_os/images/linux).

~/zcu104_prj/linux_components/src/boot$ cd ../../
~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/boot.scr ./src/image
~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/image.ub ./src/image
~/zcu104_prj/linux_components$ cp ../linux_os/images/linux/rootfs.cpio.gz ./src/image
~/zcu104_prj/linux_components$ cd ./src/image

Next, an initialization script needs to be created for the system to tell it what hardware platform to load at boot up (the Linux system being emulated. Create an init.sh script in the image directory using your text editor of choice:

~/zcu104_prj/linux_components/src/image$ gedit init.sh

And copy the following into it:

cp ./platform_desc.txt /etc/xocl.txt
export XILINX_XRT=/usr

Then create the platform_desc.txt file in the image directory using your text editor of choice:

~/zcu104_prj/linux_components/src/image$ gedit platform_desc.txt

And add the name of the exported hardware platform from Vivado to it (without the.xsa file extension though):

zcu104_base

Finally, install the SDK generated from the PetaLinux into the linux_components directory:

~/zcu104_prj/linux_components/src/image$ cd ../../../linux_os/images/linux
~/zcu104_prj/linux_os/images/linux$ ./sdk.sh -d ../../../linux_components (make sure PetaLinux tools are still sourced)

Create Vitis Workspace

Create a Vitis workspace directory in ~/zcu104_prj/ then launch Vitis and select that directory was the new workspace.

~/zcu104_prj$ mkdir -p vitis_workspace
~/zcu104_prj$ source /tools/Xilinx/Vitis/2021.2/settings64.sh
~/zcu104_prj$ vitis

Vitis Platform Project

Create a new Platform Project targeting the exported hardware platform from Vivado. Open the platform.spr file from the Platform Project and navigate to zcu104_base > psu_cortexa53 > linux on psu_cortexa53. Set the paths as to point to the Linux components directory as outlined below (expect for the QEMU Arguments and PMU QEMU Arguments, leave them set to the default).

Use the drop down next to the Bif File line to generate a bif file then copy it to the ./linux_components/src/boot/ directory before setting it's path manually.

--> Bif File                  = ~/zcu104_prj/linux_components/src/boot/linux.bif
--> Boot Components Directory = ~/zcu104_prj/linux_components/src/boot
--> Linux Rootfs              = ~/zcu104_prj/linux_components/src/image/rootfs.cpio.gz
--> Bootmode                  = SD
--> FAT32 Partition Directory = ~/zcu104_prj/linux_components/src/image
--> (Linux Image Directory)
--> Sysroot Directory         = ~/zcu104_prj/linux_components/sysroots/cortexa72-cortexa53-xilinx-linux 
--> QEMU Data                 = ~/zcu104_prj/linux_components/src/boot
--> QEMU Arguments            = /tools/Xilinx/Vitis/2021.2/data/emulation/platforms/zynqmp/sw/a53_linux/qemu/qemu_args.txt
--> PMU QEMU Arguments        = /tools/Xilinx/Vitis/2021.2/data/emulation/platforms/zynqmp/sw/a53_linux/qemu/pmu_args.txt

I'm not sure why I had to copy the generated linux.bif from the default location to the ./linux_components/src/boot/ directory, but the Linux boot in the emulator would hang indefinitely otherwise. It seems like QEMU is looking for all of the boot components in one place.

Once the platform paths have been set, build the platform project using crtl+B or the build icon from the toolbar.

Vitis Application Project

With the platform project created and built, create a new application project based on the platform project just created in the previous step.

The application settings paths will auto-populate since the paths were already specified in the platform project before building it:

I'm using the vadd accelerated application template and simply modifying it with my own code to save myself a bit of time.

Open the vadd_dds.prj file and select either Emulation-SW or Emulation-HW for the active configuration build. I recommend using HW Emulation for accelerated applications because SW Emulation will emulate the entire system, it doesn't fully test the hardware design that's living in the programmable logic of the device.

Since I added a DDS compiler and gave its AXI Stream ports custom names by specifying an SP_tag for them, a system.cfg file needs to be created for the Vitis linker to map those connections out properly in the accelerated kernel:

~/zcu104_prj/vitis_workspace/vadd_dds_system_hw_link$ gedit system.cfg

Then add the following:

[connectivity]
stream_connect = M_AXIS_DATA:krnl_vadd_1.dds_in
stream_connect = krnl_vadd_1.phase_out:S_AXIS_PHASE

In Binary Container Settings set V++ command line options to: --config ../system.cfg so that the Vitis linker knows where to find the system.cfg file when building/compiling the accelerated application. To access the Binary Container Settings, navigate to the Assistant window then under the drop down of vadd_dds_system > vadd_dds_system_hw_link > Emulation-HW, right-click on binary_container_1:

As you can probably tell, it is possible to place the system.cfg file wherever you'd prefer. You'll just have to specify it in the V++ command line options argument. The base directory it starts looking in is /<vitis workspace>/<application project name>_system_hw_link/Emulation-HW (since I have the active build set to HW emulation).

I find the most optimal directory is one folder up in /<vitis workspace>/<application project name>_system_hw_link/ so both HW and SW emulator can use it.

After setting the V++ command line options to point to the system.cfg file, click Apply and Close.

Since this post has gotten much longer than I originally anticipated, I'm going to save the explanation of my custom C++ code for controlling the DDS compiler for my next project post. For now, find the main application code attached below in vadd.cpp and the kernel HLS C++ code attached below in krnl_vadd.cpp. You can also skip adding the custom code and just run the Vadd accelerated application as is if you're just wanting to get familiar with running the emulator.

Once the custom C++ code has been added to the application and kernel, build the project by right-clicking on vadd_dds_system in the Explorer window and selecting Build Project. This will take a bit since HLS has to also run to generate the accelerated kernel.

Note: since HLS is being called under the hood here, be sure you've applied the Y2K22 patch if you haven't already.

Launch Debugger on Emulator

Once the project has been successfully built, we can start the emulator. Select Xilinx > Start/Stop Emulator. Then click Start for the active build configuration:

The emulator always takes several minutes to boot up the QEMU Linux image for me, so I recommend a bit of patience here.

You'll see the boot progress in the Emulation console window, and once fully booted, you'll see the Linux command line prompt:

With the emulator started, we can now launch a debug session of the application on it. Right-click on vadd_dds_system in the Explorer window and select Debug As > Launch HW Emulator.

Just like with any other debug session, it will launch the application and stop at a breakpoint set right at the beginning of the main function:

You can step through your accelerated application and utilize tools such as the variables and memory monitors. Once the accelerated application has been ran, and affects it might have had such as created files will be reflected in the emulated Linux image in the Emulation Console.

Terminate the debug run and disconnect using the red square icon and red N icon in the toolbar:

Since I had the output from my DDS compiler written to a text file, I was able to transfer it to my host machine from the emulation console using a file transfer command such as scp. Run ifconfig on the host PC after the emulator is already running to see its local IP:

scp /mnt/sd-mmcblk0p1/wave_out.txt <host user>@10.xxx.xx.x:/<desired host location>

I then simply used LibreOffice Calc (Excel) on my host machine to plot the output data from the DDS compiler to verify it:

You can make any changes you would like and rebuild and relaunch a debug run of the accelerated application with the emulator still running. Once done however, you can stop the emulator the same way it was started: Select Xilinx > Start/Stop Emulator. Then click Stop.

And that's it. I do plan to use this project as a reference for most of my emulations, if I don't use this exact workspace itself before transferring the code to my Kria projects. I've gotten questions of how to run emulation for Kria applications and this is the best solution that I've found so far.

Stay tuned for my detailed explanation of what all is happening with the DDS code here.

/*******************************************************************************
Vendor: Xilinx
Associated Filename: vadd.cpp
Purpose: VITIS vector addition

*******************************************************************************
Copyright (C) 2019 XILINX, Inc.

This file contains confidential and proprietary information of Xilinx, Inc. and
is protected under U.S. and international copyright and other intellectual
property laws.

DISCLAIMER
This disclaimer is not a license and does not grant any rights to the materials
distributed herewith. Except as otherwise provided in a valid license issued to
you by Xilinx, and to the maximum extent permitted by applicable law:
(1) THESE MATERIALS ARE MADE AVAILABLE "AS IS" AND WITH ALL FAULTS, AND XILINX
HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR
FITNESS FOR ANY PARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether
in contract or tort, including negligence, or under any other theory of
liability) for any loss or damage of any kind or nature related to, arising under
or in connection with these materials, including for any direct, or any indirect,
special, incidental, or consequential loss or damage (including loss of data,
profits, goodwill, or any type of loss or damage suffered as a result of any
action brought by a third party) even if such damage or loss was reasonably
foreseeable or Xilinx had been advised of the possibility of the same.

CRITICAL APPLICATIONS
Xilinx products are not designed or intended to be fail-safe, or for use in any
application requiring fail-safe performance, such as life-support or safety
devices or systems, Class III medical devices, nuclear facilities, applications
related to the deployment of airbags, or any other applications that could lead
to death, personal injury, or severe property or environmental damage
(individually and collectively, "Critical Applications"). Customer assumes the
sole risk and liability of any use of Xilinx products in Critical Applications,
subject only to applicable laws and regulations governing limitations on product
liability.

THIS COPYRIGHT NOTICE AND DISCLAIMER MUST BE RETAINED AS PART OF THIS FILE AT
ALL TIMES.

*******************************************************************************/
#define OCL_CHECK(error, call)                                                                   \
    call;                                                                                        \
    if (error != CL_SUCCESS) {                                                                   \
        printf("%s:%d Error calling " #call ", error code is: %d\n", __FILE__, __LINE__, error); \
        exit(EXIT_FAILURE);                                                                      \
    }

#include <fstream>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#include "vadd.h"
#include "ap_int.h"


static const int DATA_SIZE = 4096;

static const std::string error_message =
    "Error: Result mismatch:\n"
    "i = %d CPU result = %d Device result = %d\n";

int main(int argc, char* argv[]) {

    //TARGET_DEVICE macro needs to be passed from gcc command line
    if(argc != 2) {
		std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
		return EXIT_FAILURE;
	}

    std::string xclbinFilename = argv[1];
    
    // Compute the size of array in bytes
    size_t size_in_bytes = DATA_SIZE * sizeof(int);
    
    // Creates a vector of DATA_SIZE elements with an initial value of 10 and 32
    // using customized allocator for getting buffer alignment to 4k boundary
    
    std::vector<cl::Device> devices;
    cl::Device device;
    cl_int err;
    cl::Context context;
    cl::CommandQueue q;
    cl::Kernel krnl_vector_add;
    cl::Program program;
    std::vector<cl::Platform> platforms;
    bool found_device = false;

    //traversing all Platforms To find Xilinx Platform and targeted
    //Device in Xilinx Platform
    cl::Platform::get(&platforms);
    for(size_t i = 0; (i < platforms.size() ) & (found_device == false) ;i++){
        cl::Platform platform = platforms[i];
        std::string platformName = platform.getInfo<CL_PLATFORM_NAME>();
        if ( platformName == "Xilinx"){
            devices.clear();
            platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
	    if (devices.size()){
		    device = devices[0];
		    found_device = true;
		    break;
	    }
        }
    }
    if (found_device == false){
       std::cout << "Error: Unable to find Target Device " 
           << device.getInfo<CL_DEVICE_NAME>() << std::endl;
       return EXIT_FAILURE; 
    }

    // Creating Context and Command Queue for selected device
    OCL_CHECK(err, context = cl::Context(device, NULL, NULL, NULL, &err));
    OCL_CHECK(err, q = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err));

    std::cout << "INFO: Reading " << xclbinFilename << std::endl;
    FILE* fp;
    if ((fp = fopen(xclbinFilename.c_str(), "r")) == nullptr) {
        printf("ERROR: %s xclbin not available please build\n", xclbinFilename.c_str());
        exit(EXIT_FAILURE);
    }
    // Load xclbin 
    std::cout << "Loading: '" << xclbinFilename << "'\n";
    std::ifstream bin_file(xclbinFilename, std::ifstream::binary);
    bin_file.seekg (0, bin_file.end);
    unsigned nb = bin_file.tellg();
    bin_file.seekg (0, bin_file.beg);
    char *buf = new char [nb];
    bin_file.read(buf, nb);
    
    // Creating Program from Binary File
    cl::Program::Binaries bins;
    bins.push_back({buf,nb});
    devices.resize(1);
    OCL_CHECK(err, program = cl::Program(context, devices, bins, NULL, &err));
    
    // This call will get the kernel object from program. A kernel is an 
    // OpenCL function that is executed on the FPGA. 
    OCL_CHECK(err, krnl_vector_add = cl::Kernel(program,"krnl_vadd", &err));

    // These commands will allocate memory on the Device. The cl::Buffer objects can
    // be used to reference the memory locations on the device. 
    std::cout << "Creating buffer objects for each variable..." << std::endl;
    OCL_CHECK(err, cl::Buffer buffer_a(context, CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
    OCL_CHECK(err, cl::Buffer buffer_b(context, CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
    OCL_CHECK(err, cl::Buffer buffer_result(context, CL_MEM_WRITE_ONLY, size_in_bytes, NULL, &err));

    OCL_CHECK(err, cl::Buffer buffer_phase(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
    OCL_CHECK(err, cl::Buffer buffer_waveout(context, CL_MEM_WRITE_ONLY, 1024*sizeof(int32_t), NULL, &err));

    //set the kernel Arguments
    std::cout << "Setting kernel arguments..." << std::endl;
    int narg=0;
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_a));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_b));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_result));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,DATA_SIZE));

    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_phase));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_waveout));

    //We then need to map our OpenCL buffers to get the pointers
    std::cout << "Mapping buffers..." << std::endl;
    int *ptr_a;
    int *ptr_b;
    int *ptr_result;
    OCL_CHECK(err, ptr_a = (int*)q.enqueueMapBuffer (buffer_a, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
    OCL_CHECK(err, ptr_b = (int*)q.enqueueMapBuffer (buffer_b, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
    OCL_CHECK(err, ptr_result = (int*)q.enqueueMapBuffer (buffer_result, CL_TRUE, CL_MAP_READ, 0, size_in_bytes, NULL, NULL, &err));

    uint32_t *ptr_phase;
    int32_t *ptr_waveout;
    OCL_CHECK(err, ptr_phase = (uint32_t*)q.enqueueMapBuffer (buffer_phase, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
    OCL_CHECK(err, ptr_waveout = (int32_t*) q.enqueueMapBuffer (buffer_waveout, CL_TRUE, CL_MAP_READ, 0, 1024*sizeof(int32_t), NULL, NULL, &err));

    std::cout << "Writing 1MHz phase increment to DDS..." << std::endl;
    uint32_t phase_1MHz = 0x0051EB85;
    for(int i=0; i<DATA_SIZE; i++){
    	ptr_phase[i] = phase_1MHz;
    }

    // Data will be migrated to kernel space
    std::cout << "Mirgrate data to kernel space." << std::endl;
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_a,buffer_b},0/* 0 means from host*/));
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_phase},0/* 0 means from host*/));

    //Launch the Kernel
    std::cout << "Launch the kernel." << std::endl;
    OCL_CHECK(err, err = q.enqueueTask(krnl_vector_add));

    // The result of the previous kernel execution will need to be retrieved in
    // order to view the results. This call will transfer the data from FPGA to
    // source_results vector
    std::cout << "Transfer the data from FPGA to source results vector." << std::endl;
    OCL_CHECK(err, q.enqueueMigrateMemObjects({buffer_result},CL_MIGRATE_MEM_OBJECT_HOST));
    OCL_CHECK(err, q.enqueueMigrateMemObjects({buffer_waveout},CL_MIGRATE_MEM_OBJECT_HOST));

    std::cout << "Verify the result." << std::endl;
    //Verify the result
    int match = 0;
    for (int i = 0; i < DATA_SIZE; i++) {
        int host_result = ptr_a[i] + ptr_b[i];
        std::cout << "ptr_result value = " << ptr_result[i] << std::endl;
        std::cout << "host_result value = " << host_result << std::endl;
        if (ptr_result[i] != host_result) {
            printf(error_message.c_str(), i, host_result, ptr_result[i]);
            match = 1;
            break;
        }
    }

    std::cout << "Opening text file to write output waveform to..." << std::endl;
    FILE *fp_wave;
    fp_wave=fopen("wave_out.txt","w");

    std::cout << "Writing the output waveform from the results buffer to the text file..." << std::endl;
    for (int i = 0; i < 1024; i++) {
    	fprintf(fp_wave,"%i\n",ptr_waveout[i]);
    }

    std::cout << "Closing text file..." << std::endl;
    fclose(fp_wave);

    std::cout << "Unmapping buffers..." << std::endl;
    OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_a , ptr_a));
    OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_b , ptr_b));
    OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_result , ptr_result));

    q.enqueueUnmapMemObject(buffer_phase, ptr_phase);
    q.enqueueUnmapMemObject(buffer_waveout, ptr_waveout);

    std::cout << "Final error check..." << std::endl;
    OCL_CHECK(err, err = q.finish());

    std::cout << "TEST " << (match ? "FAILED" : "PASSED") << std::endl; 
    return (match ? EXIT_FAILURE :  EXIT_SUCCESS);

}

/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/

/*******************************************************************************
Description:

    This example uses the load/compute/store coding style which is generally
    the most efficient for implementing kernels using HLS. The load and store
    functions are responsible for moving data in and out of the kernel as
    efficiently as possible. The core functionality is decomposed across one
    of more compute functions. Whenever possible, the compute function should
    pass data through HLS streams and should contain a single set of nested loops.

    HLS stream objects are used to pass data between producer and consumer
    functions. Stream read and write operations have a blocking behavior which
    allows consumers and producers to synchronize with each other automatically.

    The dataflow pragma instructs the compiler to enable task-level pipelining.
    This is required for to load/compute/store functions to execute in a parallel
    and pipelined manner. Here the kernel loads, computes and stores NUM_WORDS integer values per
    clock cycle and is implemented as below:
                                       _____________
                                      |             |<----- Input Vector 1 from Global Memory
                                      |  load_input |       __
                                      |_____________|----->|  |
                                       _____________       |  | in1_stream
Input Vector 2 from Global Memory --->|             |      |__|
                               __     |  load_input |        |
                              |  |<---|_____________|        |
                   in2_stream |  |     _____________         |
                              |__|--->|             |<--------
                                      | compute_add |      __
                                      |_____________|---->|  |
                                       ______________     |  | out_stream
                                      |              |<---|__|
                                      | store_result |
                                      |______________|-----> Output result to Global Memory

*******************************************************************************/

// Includes
#include <fstream>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <stdint.h>
#include <hls_stream.h>
#include "ap_int.h"
#include "ap_axi_sdata.h"
#include "hls_stream.h"

#define BUFFER_SIZE 256
#define DATA_SIZE 4096
#define WAVE_SIZE 1024

// TRIPCOUNT identifier
const int c_size = DATA_SIZE;

static void load_input(uint32_t* in, hls::stream<uint32_t>& inStream, int size) {
mem_rd:
    for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        inStream << in[i];
    }
}

static void compute_add(hls::stream<uint32_t>& in1_stream, hls::stream<uint32_t>& in2_stream, hls::stream<uint32_t>& out_stream, int size) {
// The kernel is operating with vector of NUM_WORDS integers. The + operator performs
// an element-wise add, resulting in NUM_WORDS parallel additions.
execute:
    for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out_stream << (in1_stream.read() + in2_stream.read());
    }
}

static void store_result(uint32_t* out, hls::stream<uint32_t>& out_stream, int size) {
mem_wr:
    for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out[i] = out_stream.read();
    }
}

typedef ap_axis<16, 0, 0, 0> data_pkt;
typedef ap_axis<31, 0, 0, 0> phase_pkt;

extern "C" {

/*
    Vector Addition Kernel

    Arguments:
        in1      (input)  --> Input vector 1
        in2      (input)  --> Input vector 2
        out      (output) --> Output vector
        size     (input)  --> Number of elements in vector
        wave_out () --> DDS output waveform
        dds_in   ()  --> DDS input waveform from platform DDS Compiler IP
*/

void krnl_vadd(uint32_t* in1, uint32_t* in2, uint32_t* out, int size,
	       uint32_t *phase, int32_t *wave_out,
	       hls::stream<data_pkt> &dds_in, hls::stream<phase_pkt> &phase_out) {
#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
#pragma HLS INTERFACE m_axi port = out bundle = gmem0

    static hls::stream<uint32_t> in1_stream("input_stream_1");
    static hls::stream<uint32_t> in2_stream("input_stream_2");
    static hls::stream<uint32_t> out_stream("output_stream");

#pragma HLS dataflow
    // dataflow pragma instruct compiler to run following three APIs in parallel
    load_input(in1, in1_stream, size);
    load_input(in2, in2_stream, size);
    compute_add(in1_stream, in2_stream, out_stream, size);
    store_result(out, out_stream, size);

    for (int i = 0; i < size; i++) {
	#pragma HLS PIPELINE II = 1
    	phase_pkt val;
    	val.data = phase[i];
    	phase_out.write(val);
    }

    for (int i = 0; i < WAVE_SIZE; i++) {
    #pragma HLS PIPELINE II = 1
    	data_pkt value = dds_in.read();
    	wave_out[i] = value.data;
    }

}
}

Credits

Whitney Knitter

172 projects • 1841 followers

All thoughts/opinions are my own and do not reflect those of any company/entity I currently/previously associate with.

Contact

Comments

Please log in or sign up to comment.

Embed the widget on your own site

Debugging Accelerated Apps using Emulation in Vitis 2021.2

Debugging Accelerated Apps using Emulation in Vitis 2021.2

Things used in this project

Hardware components

Software apps and online services

Story

Hardware Design in Vivado

Export Platform

Create PetaLinux Project

Enable XRT in Root Filesystem

Build PetaLinux Project with SDK

Create Linux Components Directory

Create Vitis Workspace

Vitis Platform Project

Vitis Application Project

Launch Debugger on Emulator

Code

vadd.cpp

krnl_vadd.cpp

Credits

Whitney Knitter

Comments

Related channels and tags