DDS Compiler in Vivado Block Design
Accelerated Kernel for Writing Phase & Reading Data
Accelerated Application
Plotting Result to Verify
Getting Bit by the Digital Data Goblin

Published September 8, 2022 © GPL3+

Accelerated Sine Wave Application in Vitis 2021.2

This project walks through how to offload the generation of a sine wave into and FPGA's PL in an accelerated application in Vitis 2021.2

AdvancedFull instructions provided3 hours1,147

Accelerated Sine Wave Application in Vitis 2021.2

Things used in this project

Software apps and online services

AMD Vitis Unified Software Platform

AMD Vivado Design Suite

AMD PetaLinux

Story

Direct Digital Synthesizers (DDS) are a key tool in digital communication systems as they provide a way to generate a complex sinusoid in the digital domain. They are kind of like my swiss army knife as a software defined radio (SDR) designer, so I've found it worth while to explore the different ways to implement them.

So far, I've done a few write-ups on implementing the Xilinx DDS Compiler IP block in programmable logic both in the block design and just direct instantiation in custom HDL code. I also did a write up of how to control a DDS compiler in the block design from a bare-metal C application running on the ARM-core of a Zynq, where I highlighted some of the key differences in controlling the DDS compiler from the PL versus the application space in C. Namely this came down to an issue of how tightly one can control the timing of the phase value input of the DDS compiler.

In this project, I’ll be walking through how to control a DDS compiler from an accelerated application running in Linux on the FPGA. Technically, this is would be the acceleration of a DDS compiler since I’m offloading its functionality to the PL and streaming back and forth to it via mapped memory space with the accelerated kernel.

If you’ve read my previous write-up, you’re probably wondering how this is any different from the bare-metal C application that wrote phase input data to memory (DDR) via an AXI DMA controller, to which the TL;DR is that the accelerated kernel essentially bypasses some of the latency introduced by the AXI DMA controller by directly writing/reading to/from its ports mapped to those memory locations. Acceleration like this is a more integrated approach that offers better throughput to meet higher data bandwidth requirements.

DDS Compiler in Vivado Block Design

I personally use the following configuration of the DDS compiler the most where I’m streaming in a 32-bit phase increment value (not using the phase offset option for the moment, but it’s easy to add) and outputting just a 16-bit cosine waveform (no phase output for the moment either, but adding it would also be fairly straightforward when needed as well).

1 / 4

Since the phase input and data output ports of the DDS compiler are going to be mapped to the same memory locations as the kernel ports for the kernel to be able to access them, they will be left unconnected in the block design of Vivado. Only the clock port is connected in the block design (the same clock that is designated at the default clock in the Platform Setup tab.

Speaking of the Platform Setup tab, the AXI Stream ports from the DDS compiler will appear under the AXI Stream Port tab. I added SP Tag values for each so to make linking them in the kernel easier/clearer later on. Setting custom SP Tag values for each port is a necessity if any other AXI Stream ports are added to the block design as the SP Tags are the reference designators the V++ compiler in Vitis uses to map the ports to specific memory locations when compiling the accelerated application.

1 / 4

Once the DDS compiler IP has been added to the block design and configured as desired, validate & save the block design then follow the normal flow to generate a bitstream for the design including disabling incremental synthesis, generating the block design, and running synthesis & implementation (see project flow for this design in my previous write-up here).

Accelerated Kernel for Writing Phase & Reading Data

When developing an accelerated application and I get to the stage of finally writing come C++ code, I first like to focus on the kernel and how I’m going to get the data in and out of memory. The kernel in an accelerated application is technically an HLS module. For those not familiar, HLS refers to Xilinx’s high-level synthesis compiler that takes C/C++ code and compiles it into RTL that is instantiated/ran in the programmable logic of the FPGA. So the accelerated kernel itself is actually RTL running in the PL that the C/C++ code of the application is communicating with.

Note: C is worthless in accelerated applications as I’ve found from personal experience (as well as most HLS projects), it’s too low level and missing key functionalities as compared to C++. Predominantly with the way memory and pointers are handled. I highly recommend brushing up on your C++ if you’re normally a C person like I was or you’re life will be very difficult…

Since we’re dealing with AXI stream interfaces, add the header files used to support the AXIS data type and HLS stream data type.

#include "ap_int.h"
#include "ap_axi_sdata.h"
#include "hls_stream.h"

Then start by defining AXI stream packets the respective data widths for the data stream output from the DDS compiler and another for the phase increment stream to the input of the DDS compiler right before the kernel main function definition:

typedef ap_axis<16, 0, 0, 0> data_pkt;
typedef ap_axis<32, 0, 0, 0> phase_pkt;

Next, add the AXI stream packets and data pointers to load individual elements into the AXI stream as arguments to the kernel main function (krnl_vadd in the case of the Vadd application template I’m using in here):

void krnl_vadd(uint32_t* in1, uint32_t* in2, uint32_t* out, int size,
	       uint32_t *phase, int32_t *wave_out,
	       hls::stream<data_pkt> &dds_in, hls::stream<phase_pkt> &phase_out)

Note that the phase variable is specifically constrained to an unsigned 32-bit integer, and and the wave_out variable is a signed 32-bit integer (versus simply declaring it as an integer with int). This is more important than usual since this C++ code is being compiled into RTL, and not being very specific/explicit with your data types can cause the compiled RTL’s behavior to come out very differently than you intended. You have to make sure your digital hardware thinking hat stays on here, because this is an angry crocodile pit for you software engineers.

Both the phase increment data and the waveform data are simply being written to/read from the DDR with no processing, so adding those write/read processes to the body of krnl_vadd():

for (int i = 0; i < size; i++) {
#pragma HLS PIPELINE II = 1
	phase_pkt val;
    	val.data = phase[i];
    	phase_out.write(val);
}

for (int i = 0; i < WAVE_SIZE; i++) {
#pragma HLS PIPELINE II = 1
	data_pkt value = dds_in.read();
    	wave_out[i] = value.data;
}

And that’s it for the accelerated kernel. As I mentioned, since the data is simply being written to/read from the DDR with no processing, it makes the code pretty simple here. My main goal was to highlight the read/write process to the DDR since all accelerated kernels will need that functionality.

As I mentioned previously, the SP Tag is how the the V++ compiler in Vitis identifies AXI ports to map them to the appropriate kernel arguments it its main function. The SP Tags for the DDS compiler are defined in the Vivado block design then mapped to a kernel function argument via a system.cfg file located somewhere in the Vitis workspace directory. I personally like to place it in <vitis workspace>/<application name>_system_hw_link/ as this is the common directory any emulation or hardware build of the application project in Vitis will use. I like to simply run my preferred text editor from the command line to generate the system.cfg file and open it for me to edit:

~/<vitis workspace>/<application name>_system_hw_link$ gedit system.cfg

The first line of system.cfg is [connectivity] followed by the appropriate AXI stream connections:

[connectivity]
stream_connect = M_AXIS_DATA:krnl_vadd_1.dds_in
stream_connect = krnl_vadd_1.phase_out:S_AXIS_PHASE

As you can see, this is where the master AXI stream port of the DDS compiler that’s streaming out waveform data is connected to the AXI stream packet, dds_in, in the kernel; and the slave AXI stream port of the DDS compiler is connected to the AXI stream packet, phase_out.

The location of the system.cfg file is made know to the V++ compiler in the Binary Container Settings located in the Assistant window then under the drop down of vadd_dds_system > vadd_dds_system_hw_link > Emulation-HW, right-click on binary_container_1:

Then specify the system.cfg file and its location relative to the active build configuration directory in the V++ command line options argument. For reference, the possible build configuration directories are:

<vitis workspace>/<application name>_system_hw_link/Emulation-SW
<vitis workspace>/<application name>_system_hw_link/Emulation-HW
<vitis workspace>/<application name>_system_hw_link/Hardware

This kind shows why I like placing system.cfg in <vitis workspace>/<application name>_system_hw_link/ because then the location to specify in V++ command line options for each build configuration is simply../

Thus, my V++ command line options for all build configurations is:

--config ../system.cfg

Accelerated Application

Once I’ve completed the kernel and I know how the data is coming in/out of memory, I move on to writing the actually accelerated application.

First things first, a buffer object needs to be created for the phase and and waveform data variables that the kernel is reading/writing in DDR:

OCL_CHECK(err, cl::Buffer buffer_phase(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
OCL_CHECK(err, cl::Buffer buffer_waveout(context, CL_MEM_WRITE_ONLY, 1024*sizeof(int32_t), NULL, &err));

This is following by setting which argument number of the kernel’s main function (krnl_vadd() in this case) the buffer object relates and is zero indexed. For example, since the first variable is the A[i] input of the A[i] + B[i] = C[i] Vadd template code, it is argument number zero.

Since the phase buffer object in the accelerated application is connecting to the unsigned 32-bit integer argument called ‘phase’ in the kernel, it’s argument number 4. Then you can see how the waveform output buffer object is argument number 5. The Vadd template’s way of handling this is pretty slick in my opinion with the incrementing narg variable so all’s you have to do is make sure they’re in the right order:

int narg=0;
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_a));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_b));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_result));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,DATA_SIZE));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_phase));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_waveout));

Next, the buffer objects need to be mapped to local memory pointers in the application so we can then finally write some “normal” software:

uint32_t *ptr_phase;
int32_t *ptr_waveout;
OCL_CHECK(err, ptr_phase = (uint32_t*)q.enqueueMapBuffer (buffer_phase, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
OCL_CHECK(err, ptr_waveout = (int32_t*) q.enqueueMapBuffer (buffer_waveout, CL_TRUE, CL_MAP_READ, 0, 1024*sizeof(int32_t), NULL, NULL, &err));

Take note that my local pointers, ptr_phase and ptr_waveout, have the same date types their respective variables in the kernel’s main function, phase and wave_out. Again, this is a key detail that can cause your application to compile with a very different functionality than you’re expecting if not matched up.

At this point, the local pointers can be treated like normal pointers for your accelerated application’s functionality. So before I launch the kernel, I’m going to load the phase increment data I want written out to the DDS compiler’s input from the accelerated kernel:

uint32_t phase_1MHz = 0x0051EB85;
for(int i=0; i<DATA_SIZE; i++){
	ptr_phase[i] = phase_1MHz;
}

Based on the DDS compiler’s settings in the Vivado block diagram, I’m loading the phase pointer to have it output a 1MHz cosine wave for a period of 4096 samples.

With the phase increment data loaded, I let the default code in the template migrate the data to the kernel, launch it, then transfer the received data back from the kernel. This ultimately results in the local pointer, ptr_waveout, now pointing to the waveform data output from the the DDS compiler. To verify the data is in face representative of a 1MHz cosine wave, I simply wrote the contents of the pointer to a text file to store locally in the embedded Linux image:

FILE *fp_wave;
fp_wave=fopen("wave_out.txt","w");

for (int i = 0; i < 1024; i++) {
	fprintf(fp_wave,"%i\n",ptr_waveout[i]);
}

fclose(fp_wave);

Note that the format specifier here writing to the text file is for unsigned data even though it’s signed data, I honestly can’t tell you want happened here but something about using the signed data format specifier was converting the data so the cosine waveform didn’t plot correctly.

Then finally, to clean up after the accelerated application has been ran, unmap the two buffers from the DDR:

OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_result , ptr_result));

For reference, here’s a diagram of the data flow between the PL, kernel, and accelerated application:

Plotting Result to Verify

I then transferred the text file to my host PC (I ran my application on the hardware emulator in Vitis as outlined in my previous post, but you can use this same process running on actual hardware):

root@linux_os: ~# scp /mnt/sd-mmcblk0p1/wave_out.txt <host user>@xxx.xxx.xx.x:/<desired host location>

Where I threw together a quick little Python script (attached below) to plot and display the data to verify that it indeed was a cosine wave. You could also copy+paste the data into an Excel file and plot it that way.

And we have a cosine!

Getting Bit by the Digital Data Goblin

So remember when I warned about an angry crocodile pit for you software engineers?? When I also found one for us digital hardware folks…

I initially had the data width of the AXI stream packet in the kernel set to match the configuration of Vivado as OF WHAT I THOUGHT WAS 16 bits wide:

typedef ap_axis<15, 0, 0, 0> data_pkt;

However, when I went to plot my cosine wave I noticed that the data values were overflowing at +/- 16384:

The DDS compiler is configured in Vivado to output in 2’s compliment, so between the sign bit and whatever black magic was happening in the Vitis compiler of the HLS kernel, I was losing a bit’s worth of data off the MSB… Since I should be able to count up to (2^16) - 1 = 65535 = -/+ 32767 with 16 bits of data, but was only counting up to +/- 16384 = 32767 = (2^15) - 1.

Well, after some digging through the Vitis user guide and Xilinx forums, I discovered that the declaration of the AXI stream packet IS NOT ZERO INDEXED. So when I was ***thought*** I was creating a 16-bit wide packet with typedef ap_axis<15, 0, 0, 0>, I was actually only creating a 15-bit wide packet. And typedef ap_axis<16, 0, 0, 0> does not create a 17-bit wide AXI stream like I thought, but instead the actual 16 bits I needed.

typedef ap_axis<16, 0, 0, 0> data_pkt;

So fixing my AXI stream packet to be actually 16 bits wide easily gave me my pretty cosine waveform that I had originally been expecting:

This discovery had about the same impact on me as feeding a Mogwai after midnight, and had me at full Gremlin-level anger. But hey, I definitely won’t forget that the ap_axis declarations aren’t zero-indexed in my next project!

/**
* Copyright (C) 2019-2021 Xilinx, Inc
*
* Licensed under the Apache License, Version 2.0 (the "License"). You may
* not use this file except in compliance with the License. A copy of the
* License is located at
*
*     http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
*/

/*******************************************************************************
Description:

    This example uses the load/compute/store coding style which is generally
    the most efficient for implementing kernels using HLS. The load and store
    functions are responsible for moving data in and out of the kernel as
    efficiently as possible. The core functionality is decomposed across one
    of more compute functions. Whenever possible, the compute function should
    pass data through HLS streams and should contain a single set of nested loops.

    HLS stream objects are used to pass data between producer and consumer
    functions. Stream read and write operations have a blocking behavior which
    allows consumers and producers to synchronize with each other automatically.

    The dataflow pragma instructs the compiler to enable task-level pipelining.
    This is required for to load/compute/store functions to execute in a parallel
    and pipelined manner. Here the kernel loads, computes and stores NUM_WORDS integer values per
    clock cycle and is implemented as below:
                                       _____________
                                      |             |<----- Input Vector 1 from Global Memory
                                      |  load_input |       __
                                      |_____________|----->|  |
                                       _____________       |  | in1_stream
Input Vector 2 from Global Memory --->|             |      |__|
                               __     |  load_input |        |
                              |  |<---|_____________|        |
                   in2_stream |  |     _____________         |
                              |__|--->|             |<--------
                                      | compute_add |      __
                                      |_____________|---->|  |
                                       ______________     |  | out_stream
                                      |              |<---|__|
                                      | store_result |
                                      |______________|-----> Output result to Global Memory

*******************************************************************************/

// Includes
#include <fstream>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>
#include <stdint.h>
#include <hls_stream.h>
#include "ap_int.h"
#include "ap_axi_sdata.h"
#include "hls_stream.h"

#define BUFFER_SIZE 256
#define DATA_SIZE 4096
#define WAVE_SIZE 1024

// TRIPCOUNT identifier
const int c_size = DATA_SIZE;

static void load_input(uint32_t* in, hls::stream<uint32_t>& inStream, int size) {
mem_rd:
    for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        inStream << in[i];
    }
}

static void compute_add(hls::stream<uint32_t>& in1_stream, hls::stream<uint32_t>& in2_stream, hls::stream<uint32_t>& out_stream, int size) {
// The kernel is operating with vector of NUM_WORDS integers. The + operator performs
// an element-wise add, resulting in NUM_WORDS parallel additions.
execute:
    for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out_stream << (in1_stream.read() + in2_stream.read());
    }
}

static void store_result(uint32_t* out, hls::stream<uint32_t>& out_stream, int size) {
mem_wr:
    for (int i = 0; i < size; i++) {
#pragma HLS LOOP_TRIPCOUNT min = c_size max = c_size
        out[i] = out_stream.read();
    }
}

typedef ap_axis<16, 0, 0, 0> data_pkt;
typedef ap_axis<32, 0, 0, 0> phase_pkt;

extern "C" {

/*
    Vector Addition Kernel

    Arguments:
        in1      (input)  --> Input vector 1
        in2      (input)  --> Input vector 2
        out      (output) --> Output vector
        size     (input)  --> Number of elements in vector
        wave_out () --> DDS output waveform
        dds_in   ()  --> DDS input waveform from platform DDS Compiler IP
*/

void krnl_vadd(uint32_t* in1, uint32_t* in2, uint32_t* out, int size,
	       uint32_t *phase, int32_t *wave_out,
	       hls::stream<data_pkt> &dds_in, hls::stream<phase_pkt> &phase_out) {
#pragma HLS INTERFACE m_axi port = in1 bundle = gmem0
#pragma HLS INTERFACE m_axi port = in2 bundle = gmem1
#pragma HLS INTERFACE m_axi port = out bundle = gmem0

    static hls::stream<uint32_t> in1_stream("input_stream_1");
    static hls::stream<uint32_t> in2_stream("input_stream_2");
    static hls::stream<uint32_t> out_stream("output_stream");

#pragma HLS dataflow
    // dataflow pragma instruct compiler to run following three APIs in parallel
    load_input(in1, in1_stream, size);
    load_input(in2, in2_stream, size);
    compute_add(in1_stream, in2_stream, out_stream, size);
    store_result(out, out_stream, size);

    for (int i = 0; i < size; i++) {
	  #pragma HLS PIPELINE II = 1
    	phase_pkt val;
    	val.data = phase[i];
    	phase_out.write(val);
    }

    for (int i = 0; i < WAVE_SIZE; i++) {
    #pragma HLS PIPELINE II = 1
    	data_pkt value = dds_in.read();
    	wave_out[i] = value.data;
    }

}
}

/*******************************************************************************
Vendor: Xilinx
Associated Filename: vadd.cpp
Purpose: VITIS vector addition

*******************************************************************************
Copyright (C) 2019 XILINX, Inc.

This file contains confidential and proprietary information of Xilinx, Inc. and
is protected under U.S. and international copyright and other intellectual
property laws.

DISCLAIMER
This disclaimer is not a license and does not grant any rights to the materials
distributed herewith. Except as otherwise provided in a valid license issued to
you by Xilinx, and to the maximum extent permitted by applicable law:
(1) THESE MATERIALS ARE MADE AVAILABLE "AS IS" AND WITH ALL FAULTS, AND XILINX
HEREBY DISCLAIMS ALL WARRANTIES AND CONDITIONS, EXPRESS, IMPLIED, OR STATUTORY,
INCLUDING BUT NOT LIMITED TO WARRANTIES OF MERCHANTABILITY, NON-INFRINGEMENT, OR
FITNESS FOR ANY PARTICULAR PURPOSE; and (2) Xilinx shall not be liable (whether
in contract or tort, including negligence, or under any other theory of
liability) for any loss or damage of any kind or nature related to, arising under
or in connection with these materials, including for any direct, or any indirect,
special, incidental, or consequential loss or damage (including loss of data,
profits, goodwill, or any type of loss or damage suffered as a result of any
action brought by a third party) even if such damage or loss was reasonably
foreseeable or Xilinx had been advised of the possibility of the same.

CRITICAL APPLICATIONS
Xilinx products are not designed or intended to be fail-safe, or for use in any
application requiring fail-safe performance, such as life-support or safety
devices or systems, Class III medical devices, nuclear facilities, applications
related to the deployment of airbags, or any other applications that could lead
to death, personal injury, or severe property or environmental damage
(individually and collectively, "Critical Applications"). Customer assumes the
sole risk and liability of any use of Xilinx products in Critical Applications,
subject only to applicable laws and regulations governing limitations on product
liability.

THIS COPYRIGHT NOTICE AND DISCLAIMER MUST BE RETAINED AS PART OF THIS FILE AT
ALL TIMES.

*******************************************************************************/
#define OCL_CHECK(error, call)                                                                   \
    call;                                                                                        \
    if (error != CL_SUCCESS) {                                                                   \
        printf("%s:%d Error calling " #call ", error code is: %d\n", __FILE__, __LINE__, error); \
        exit(EXIT_FAILURE);                                                                      \
    }

#include <fstream>
#include <stdio.h>
#include <stdlib.h>
#include <iostream>

#include "vadd.h"
#include "ap_int.h"


static const int DATA_SIZE = 4096;

static const std::string error_message =
    "Error: Result mismatch:\n"
    "i = %d CPU result = %d Device result = %d\n";

int main(int argc, char* argv[]) {

    //TARGET_DEVICE macro needs to be passed from gcc command line
    if(argc != 2) {
		std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
		return EXIT_FAILURE;
	}

    std::string xclbinFilename = argv[1];
    
    // Compute the size of array in bytes
    size_t size_in_bytes = DATA_SIZE * sizeof(int);
    
    // Creates a vector of DATA_SIZE elements with an initial value of 10 and 32
    // using customized allocator for getting buffer alignment to 4k boundary
    
    std::vector<cl::Device> devices;
    cl::Device device;
    cl_int err;
    cl::Context context;
    cl::CommandQueue q;
    cl::Kernel krnl_vector_add;
    cl::Program program;
    std::vector<cl::Platform> platforms;
    bool found_device = false;

    //traversing all Platforms To find Xilinx Platform and targeted
    //Device in Xilinx Platform
    cl::Platform::get(&platforms);
    for(size_t i = 0; (i < platforms.size() ) & (found_device == false) ;i++){
        cl::Platform platform = platforms[i];
        std::string platformName = platform.getInfo<CL_PLATFORM_NAME>();
        if ( platformName == "Xilinx"){
            devices.clear();
            platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
	    if (devices.size()){
		    device = devices[0];
		    found_device = true;
		    break;
	    }
        }
    }
    if (found_device == false){
       std::cout << "Error: Unable to find Target Device " 
           << device.getInfo<CL_DEVICE_NAME>() << std::endl;
       return EXIT_FAILURE; 
    }

    // Creating Context and Command Queue for selected device
    OCL_CHECK(err, context = cl::Context(device, NULL, NULL, NULL, &err));
    OCL_CHECK(err, q = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err));

    std::cout << "INFO: Reading " << xclbinFilename << std::endl;
    FILE* fp;
    if ((fp = fopen(xclbinFilename.c_str(), "r")) == nullptr) {
        printf("ERROR: %s xclbin not available please build\n", xclbinFilename.c_str());
        exit(EXIT_FAILURE);
    }
    // Load xclbin 
    std::cout << "Loading: '" << xclbinFilename << "'\n";
    std::ifstream bin_file(xclbinFilename, std::ifstream::binary);
    bin_file.seekg (0, bin_file.end);
    unsigned nb = bin_file.tellg();
    bin_file.seekg (0, bin_file.beg);
    char *buf = new char [nb];
    bin_file.read(buf, nb);
    
    // Creating Program from Binary File
    cl::Program::Binaries bins;
    bins.push_back({buf,nb});
    devices.resize(1);
    OCL_CHECK(err, program = cl::Program(context, devices, bins, NULL, &err));
    
    // This call will get the kernel object from program. A kernel is an 
    // OpenCL function that is executed on the FPGA. 
    OCL_CHECK(err, krnl_vector_add = cl::Kernel(program,"krnl_vadd", &err));

    // These commands will allocate memory on the Device. The cl::Buffer objects can
    // be used to reference the memory locations on the device. 
    std::cout << "Creating buffer objects for each variable..." << std::endl;
    OCL_CHECK(err, cl::Buffer buffer_a(context, CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
    OCL_CHECK(err, cl::Buffer buffer_b(context, CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
    OCL_CHECK(err, cl::Buffer buffer_result(context, CL_MEM_WRITE_ONLY, size_in_bytes, NULL, &err));

    OCL_CHECK(err, cl::Buffer buffer_phase(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
    OCL_CHECK(err, cl::Buffer buffer_waveout(context, CL_MEM_WRITE_ONLY, 1024*sizeof(int32_t), NULL, &err));

    //set the kernel Arguments
    std::cout << "Setting kernel arguments..." << std::endl;
    int narg=0;
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_a));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_b));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_result));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,DATA_SIZE));

    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_phase));
    OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_waveout));

    //We then need to map our OpenCL buffers to get the pointers
    std::cout << "Mapping buffers..." << std::endl;
    int *ptr_a;
    int *ptr_b;
    int *ptr_result;
    OCL_CHECK(err, ptr_a = (int*)q.enqueueMapBuffer (buffer_a, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
    OCL_CHECK(err, ptr_b = (int*)q.enqueueMapBuffer (buffer_b, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
    OCL_CHECK(err, ptr_result = (int*)q.enqueueMapBuffer (buffer_result, CL_TRUE, CL_MAP_READ, 0, size_in_bytes, NULL, NULL, &err));

    uint32_t *ptr_phase;
    int32_t *ptr_waveout;
    OCL_CHECK(err, ptr_phase = (uint32_t*)q.enqueueMapBuffer (buffer_phase, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
    OCL_CHECK(err, ptr_waveout = (int32_t*) q.enqueueMapBuffer (buffer_waveout, CL_TRUE, CL_MAP_READ, 0, 1024*sizeof(int32_t), NULL, NULL, &err));

    std::cout << "Writing 1MHz phase increment to DDS..." << std::endl;
    uint32_t phase_1MHz = 0x0051EB85;
    for(int i=0; i<DATA_SIZE; i++){
    	ptr_phase[i] = phase_1MHz;
    }

    // Data will be migrated to kernel space
    std::cout << "Mirgrate data to kernel space." << std::endl;
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_a,buffer_b},0/* 0 means from host*/));
    OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_phase},0/* 0 means from host*/));

    //Launch the Kernel
    std::cout << "Launch the kernel." << std::endl;
    OCL_CHECK(err, err = q.enqueueTask(krnl_vector_add));

    // The result of the previous kernel execution will need to be retrieved in
    // order to view the results. This call will transfer the data from FPGA to
    // source_results vector
    std::cout << "Transfer the data from FPGA to source results vector." << std::endl;
    OCL_CHECK(err, q.enqueueMigrateMemObjects({buffer_result},CL_MIGRATE_MEM_OBJECT_HOST));
    OCL_CHECK(err, q.enqueueMigrateMemObjects({buffer_waveout},CL_MIGRATE_MEM_OBJECT_HOST));

    std::cout << "Verify the result." << std::endl;
    //Verify the result
    int match = 0;
    for (int i = 0; i < DATA_SIZE; i++) {
        int host_result = ptr_a[i] + ptr_b[i];
        std::cout << "ptr_result value = " << ptr_result[i] << std::endl;
        std::cout << "host_result value = " << host_result << std::endl;
        if (ptr_result[i] != host_result) {
            printf(error_message.c_str(), i, host_result, ptr_result[i]);
            match = 1;
            break;
        }
    }

    std::cout << "Opening text file to write output waveform to..." << std::endl;
    FILE *fp_wave;
    fp_wave=fopen("wave_out.txt","w");

    std::cout << "Writing the output waveform from the results buffer to the text file..." << std::endl;
    for (int i = 0; i < 1024; i++) {
    	fprintf(fp_wave,"%i\n",ptr_waveout[i]);
    }

    std::cout << "Closing text file..." << std::endl;
    fclose(fp_wave);

    std::cout << "Unmapping buffers..." << std::endl;
    OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_a , ptr_a));
    OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_b , ptr_b));
    OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_result , ptr_result));

    q.enqueueUnmapMemObject(buffer_phase, ptr_phase);
    q.enqueueUnmapMemObject(buffer_waveout, ptr_waveout);

    std::cout << "Final error check..." << std::endl;
    OCL_CHECK(err, err = q.finish());

    std::cout << "TEST " << (match ? "FAILED" : "PASSED") << std::endl; 
    return (match ? EXIT_FAILURE :  EXIT_SUCCESS);

}

Credits

Whitney Knitter

172 projects • 1861 followers

All thoughts/opinions are my own and do not reflect those of any company/entity I currently/previously associate with.

Contact

Comments

Please log in or sign up to comment.

Accelerated Sine Wave Application in Vitis 2021.2

Things used in this project

Software apps and online services

Story

DDS Compiler in Vivado Block Design

Accelerated Kernel for Writing Phase & Reading Data

Accelerated Application

Plotting Result to Verify

Getting Bit by the Digital Data Goblin

Code

krnl_vadd.cpp

vadd.cpp

sine_wave_plotter.py

Credits

Whitney Knitter

Comments

Embed the widget on your own site

Accelerated Sine Wave Application in Vitis 2021.2

Accelerated Sine Wave Application in Vitis 2021.2

Things used in this project

Software apps and online services

Story

DDS Compiler in Vivado Block Design

Accelerated Kernel for Writing Phase & Reading Data

Accelerated Application

Plotting Result to Verify

Getting Bit by the Digital Data Goblin

Code

krnl_vadd.cpp

vadd.cpp

sine_wave_plotter.py

Credits

Whitney Knitter

Comments

Related channels and tags