Direct Digital Synthesizers (DDS) are a key tool in digital communication systems as they provide a way to generate a complex sinusoid in the digital domain. They are kind of like my swiss army knife as a software defined radio (SDR) designer, so I've found it worth while to explore the different ways to implement them.
So far, I've done a few write-ups on implementing the Xilinx DDS Compiler IP block in programmable logic both in the block design and just direct instantiation in custom HDL code. I also did a write up of how to control a DDS compiler in the block design from a bare-metal C application running on the ARM-core of a Zynq, where I highlighted some of the key differences in controlling the DDS compiler from the PL versus the application space in C. Namely this came down to an issue of how tightly one can control the timing of the phase value input of the DDS compiler.
In this project, I’ll be walking through how to control a DDS compiler from an accelerated application running in Linux on the FPGA. Technically, this is would be the acceleration of a DDS compiler since I’m offloading its functionality to the PL and streaming back and forth to it via mapped memory space with the accelerated kernel.
If you’ve read my previous write-up, you’re probably wondering how this is any different from the bare-metal C application that wrote phase input data to memory (DDR) via an AXI DMA controller, to which the TL;DR is that the accelerated kernel essentially bypasses some of the latency introduced by the AXI DMA controller by directly writing/reading to/from its ports mapped to those memory locations. Acceleration like this is a more integrated approach that offers better throughput to meet higher data bandwidth requirements.
DDS Compiler in Vivado Block DesignI personally use the following configuration of the DDS compiler the most where I’m streaming in a 32-bit phase increment value (not using the phase offset option for the moment, but it’s easy to add) and outputting just a 16-bit cosine waveform (no phase output for the moment either, but adding it would also be fairly straightforward when needed as well).
Since the phase input and data output ports of the DDS compiler are going to be mapped to the same memory locations as the kernel ports for the kernel to be able to access them, they will be left unconnected in the block design of Vivado. Only the clock port is connected in the block design (the same clock that is designated at the default clock in the Platform Setup tab.
Speaking of the Platform Setup tab, the AXI Stream ports from the DDS compiler will appear under the AXI Stream Port tab. I added SP Tag values for each so to make linking them in the kernel easier/clearer later on. Setting custom SP Tag values for each port is a necessity if any other AXI Stream ports are added to the block design as the SP Tags are the reference designators the V++ compiler in Vitis uses to map the ports to specific memory locations when compiling the accelerated application.
Once the DDS compiler IP has been added to the block design and configured as desired, validate & save the block design then follow the normal flow to generate a bitstream for the design including disabling incremental synthesis, generating the block design, and running synthesis & implementation (see project flow for this design in my previous write-up here).
Accelerated Kernel for Writing Phase & Reading DataWhen developing an accelerated application and I get to the stage of finally writing come C++ code, I first like to focus on the kernel and how I’m going to get the data in and out of memory. The kernel in an accelerated application is technically an HLS module. For those not familiar, HLS refers to Xilinx’s high-level synthesis compiler that takes C/C++ code and compiles it into RTL that is instantiated/ran in the programmable logic of the FPGA. So the accelerated kernel itself is actually RTL running in the PL that the C/C++ code of the application is communicating with.
Note: C is worthless in accelerated applications as I’ve found from personal experience (as well as most HLS projects), it’s too low level and missing key functionalities as compared to C++. Predominantly with the way memory and pointers are handled. I highly recommend brushing up on your C++ if you’re normally a C person like I was or you’re life will be very difficult…
Since we’re dealing with AXI stream interfaces, add the header files used to support the AXIS data type and HLS stream data type.
#include "ap_int.h"
#include "ap_axi_sdata.h"
#include "hls_stream.h"
Then start by defining AXI stream packets the respective data widths for the data stream output from the DDS compiler and another for the phase increment stream to the input of the DDS compiler right before the kernel main function definition:
typedef ap_axis<16, 0, 0, 0> data_pkt;
typedef ap_axis<32, 0, 0, 0> phase_pkt;
Next, add the AXI stream packets and data pointers to load individual elements into the AXI stream as arguments to the kernel main function (krnl_vadd
in the case of the Vadd application template I’m using in here):
void krnl_vadd(uint32_t* in1, uint32_t* in2, uint32_t* out, int size,
uint32_t *phase, int32_t *wave_out,
hls::stream<data_pkt> &dds_in, hls::stream<phase_pkt> &phase_out)
Note that the phase variable is specifically constrained to an unsigned 32-bit integer, and and the wave_out variable is a signed 32-bit integer (versus simply declaring it as an integer with int). This is more important than usual since this C++ code is being compiled into RTL, and not being very specific/explicit with your data types can cause the compiled RTL’s behavior to come out very differently than you intended. You have to make sure your digital hardware thinking hat stays on here, because this is an angry crocodile pit for you software engineers.
Both the phase increment data and the waveform data are simply being written to/read from the DDR with no processing, so adding those write/read processes to the body of krnl_vadd()
:
for (int i = 0; i < size; i++) {
#pragma HLS PIPELINE II = 1
phase_pkt val;
val.data = phase[i];
phase_out.write(val);
}
for (int i = 0; i < WAVE_SIZE; i++) {
#pragma HLS PIPELINE II = 1
data_pkt value = dds_in.read();
wave_out[i] = value.data;
}
And that’s it for the accelerated kernel. As I mentioned, since the data is simply being written to/read from the DDR with no processing, it makes the code pretty simple here. My main goal was to highlight the read/write process to the DDR since all accelerated kernels will need that functionality.
As I mentioned previously, the SP Tag is how the the V++ compiler in Vitis identifies AXI ports to map them to the appropriate kernel arguments it its main function. The SP Tags for the DDS compiler are defined in the Vivado block design then mapped to a kernel function argument via a system.cfg file located somewhere in the Vitis workspace directory. I personally like to place it in <vitis workspace>/<application name>_system_hw_link/ as this is the common directory any emulation or hardware build of the application project in Vitis will use. I like to simply run my preferred text editor from the command line to generate the system.cfg file and open it for me to edit:
~/<vitis workspace>/<application name>_system_hw_link$ gedit system.cfg
The first line of system.cfg is [connectivity]
followed by the appropriate AXI stream connections:
[connectivity]
stream_connect = M_AXIS_DATA:krnl_vadd_1.dds_in
stream_connect = krnl_vadd_1.phase_out:S_AXIS_PHASE
As you can see, this is where the master AXI stream port of the DDS compiler that’s streaming out waveform data is connected to the AXI stream packet, dds_in, in the kernel; and the slave AXI stream port of the DDS compiler is connected to the AXI stream packet, phase_out.
The location of the system.cfg file is made know to the V++ compiler in the Binary Container Settings located in the Assistant window then under the drop down of vadd_dds_system > vadd_dds_system_hw_link > Emulation-HW, right-click on binary_container_1:
Then specify the system.cfg file and its location relative to the active build configuration directory in the V++ command line options argument. For reference, the possible build configuration directories are:
- <vitis workspace>/<application name>_system_hw_link/Emulation-SW
- <vitis workspace>/<application name>_system_hw_link/Emulation-HW
- <vitis workspace>/<application name>_system_hw_link/Hardware
This kind shows why I like placing system.cfg in <vitis workspace>/<application name>_system_hw_link/ because then the location to specify in V++ command line options for each build configuration is simply../
Thus, my V++ command line options for all build configurations is:
--config ../system.cfg
Once I’ve completed the kernel and I know how the data is coming in/out of memory, I move on to writing the actually accelerated application.
First things first, a buffer object needs to be created for the phase and and waveform data variables that the kernel is reading/writing in DDR:
OCL_CHECK(err, cl::Buffer buffer_phase(context, CL_MEM_ALLOC_HOST_PTR | CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
OCL_CHECK(err, cl::Buffer buffer_waveout(context, CL_MEM_WRITE_ONLY, 1024*sizeof(int32_t), NULL, &err));
This is following by setting which argument number of the kernel’s main function (krnl_vadd()
in this case) the buffer object relates and is zero indexed. For example, since the first variable is the A[i]
input of the A[i] + B[i] = C[i]
Vadd template code, it is argument number zero.
Since the phase buffer object in the accelerated application is connecting to the unsigned 32-bit integer argument called ‘phase’ in the kernel, it’s argument number 4. Then you can see how the waveform output buffer object is argument number 5. The Vadd template’s way of handling this is pretty slick in my opinion with the incrementing narg
variable so all’s you have to do is make sure they’re in the right order:
int narg=0;
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_a));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_b));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_result));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,DATA_SIZE));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_phase));
OCL_CHECK(err, err = krnl_vector_add.setArg(narg++,buffer_waveout));
Next, the buffer objects need to be mapped to local memory pointers in the application so we can then finally write some “normal” software:
uint32_t *ptr_phase;
int32_t *ptr_waveout;
OCL_CHECK(err, ptr_phase = (uint32_t*)q.enqueueMapBuffer (buffer_phase, CL_TRUE, CL_MAP_WRITE, 0, size_in_bytes, NULL, NULL, &err));
OCL_CHECK(err, ptr_waveout = (int32_t*) q.enqueueMapBuffer (buffer_waveout, CL_TRUE, CL_MAP_READ, 0, 1024*sizeof(int32_t), NULL, NULL, &err));
Take note that my local pointers, ptr_phase and ptr_waveout, have the same date types their respective variables in the kernel’s main function, phase and wave_out. Again, this is a key detail that can cause your application to compile with a very different functionality than you’re expecting if not matched up.
At this point, the local pointers can be treated like normal pointers for your accelerated application’s functionality. So before I launch the kernel, I’m going to load the phase increment data I want written out to the DDS compiler’s input from the accelerated kernel:
uint32_t phase_1MHz = 0x0051EB85;
for(int i=0; i<DATA_SIZE; i++){
ptr_phase[i] = phase_1MHz;
}
Based on the DDS compiler’s settings in the Vivado block diagram, I’m loading the phase pointer to have it output a 1MHz cosine wave for a period of 4096 samples.
With the phase increment data loaded, I let the default code in the template migrate the data to the kernel, launch it, then transfer the received data back from the kernel. This ultimately results in the local pointer, ptr_waveout, now pointing to the waveform data output from the the DDS compiler. To verify the data is in face representative of a 1MHz cosine wave, I simply wrote the contents of the pointer to a text file to store locally in the embedded Linux image:
FILE *fp_wave;
fp_wave=fopen("wave_out.txt","w");
for (int i = 0; i < 1024; i++) {
fprintf(fp_wave,"%i\n",ptr_waveout[i]);
}
fclose(fp_wave);
Note that the format specifier here writing to the text file is for unsigned data even though it’s signed data, I honestly can’t tell you want happened here but something about using the signed data format specifier was converting the data so the cosine waveform didn’t plot correctly.
Then finally, to clean up after the accelerated application has been ran, unmap the two buffers from the DDR:
OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_result , ptr_result));
For reference, here’s a diagram of the data flow between the PL, kernel, and accelerated application:
I then transferred the text file to my host PC (I ran my application on the hardware emulator in Vitis as outlined in my previous post, but you can use this same process running on actual hardware):
root@linux_os: ~# scp /mnt/sd-mmcblk0p1/wave_out.txt <host user>@xxx.xxx.xx.x:/<desired host location>
Where I threw together a quick little Python script (attached below) to plot and display the data to verify that it indeed was a cosine wave. You could also copy+paste the data into an Excel file and plot it that way.
And we have a cosine!
So remember when I warned about an angry crocodile pit for you software engineers?? When I also found one for us digital hardware folks…
I initially had the data width of the AXI stream packet in the kernel set to match the configuration of Vivado as OF WHAT I THOUGHT WAS 16 bits wide:
typedef ap_axis<15, 0, 0, 0> data_pkt;
However, when I went to plot my cosine wave I noticed that the data values were overflowing at +/- 16384:
The DDS compiler is configured in Vivado to output in 2’s compliment, so between the sign bit and whatever black magic was happening in the Vitis compiler of the HLS kernel, I was losing a bit’s worth of data off the MSB… Since I should be able to count up to (2^16) - 1 = 65535 = -/+ 32767 with 16 bits of data, but was only counting up to +/- 16384 = 32767 = (2^15) - 1.
Well, after some digging through the Vitis user guide and Xilinx forums, I discovered that the declaration of the AXI stream packet IS NOT ZERO INDEXED. So when I was ***thought*** I was creating a 16-bit wide packet with typedef ap_axis<15, 0, 0, 0>
, I was actually only creating a 15-bit wide packet. And typedef ap_axis<16, 0, 0, 0>
does not create a 17-bit wide AXI stream like I thought, but instead the actual 16 bits I needed.
typedef ap_axis<16, 0, 0, 0> data_pkt;
So fixing my AXI stream packet to be actually 16 bits wide easily gave me my pretty cosine waveform that I had originally been expecting:
This discovery had about the same impact on me as feeding a Mogwai after midnight, and had me at full Gremlin-level anger. But hey, I definitely won’t forget that the ap_axis
declarations aren’t zero-indexed in my next project!
Comments