Continuing with my series of a practical example of implementing a simple FIR in custom RTL, my next step is to use my custom FIR RTL with Xilinx's DDS Compiler IP block, which will exercise the FIR in a way that any practical application in a larger project might.
To fully validate my custom FIR Verilog module, I decided that using the DDS Compiler IP block to output a chirp signal that starts in the passband of my FIR (which currently contains coefficients for a low pass filter with a sample rate of 100Mbps) and increases until it ends in the FIR's stopband.
I'm starting with a new project in Vivado:
This project will be targeting my Arty Z7 board with the larger fabric Zynq 7020 chip (the Arty Z7-20).
First things first, create a new block design and add the Zynq Processing System IP core:
The option for block automation will appear to run and apply the Arty Z7-20 board presets to the Zynq Processing System IP:
After adding the Zynq IP, add the DDS Compiler IP from the IP library and configure it for a phase width of 32 bits, output width of 16 bits (to match the expected input width of the custom FIR Verilog module), the phase increment programmability to be streaming, and add the AXI stream options for packet streaming and to have an output tlast signal.
To add the custom FIR Verilog module, use the Add Sources option in the Flow Navigator window selecting the Add or create design sources option:
I simply selected the Add Files option then pointed to the location of the FIR Verilog file from my last project. If you want to create your own FIR Verilog module, select the Create File option.
Once the Verilog file has been added to the project, add it to the block design by right-clicking in free space on the block design and selecting the Add Module... option. Vivado will automatically detect all valid Verilog modules available in the project for you to select from. At this point, only the FIR Verilog module will appear:
A second Verilog module is needed to control the DDS Compiler and stream the phase increment values to it with the right timing such that the DDS Compiler produces the desired chirp signal for testing the FIR.
I could use a DMA IP block to stream the phase increment values from the ARM-core processor of the Zynq, but since this is just for a testing purpose and I want the same set of phase increment values to be sent in an endless loop to the DDS, controlling the DDS Compiler from the RTL fabric with a simple Verilog module is much easier and faster to implement for me.
Again, using the Add Sources option from the Flow Navigator window, create a new design source:
The phase increment state machine I developed simply starts outputting the phase increment value for a 1 MHz signal then waits the amount of time equivalent to one period of a 1 MHz signal (1 us) before increasing the phase increment value by 1 MHz.
Thus the phase increment output of the state machine sweeps from 1 MHz to 25 MHz in 1 us intervals. My FIR filter coefficients are a low pass filter (LPF) with a passband from 0 Hz - 10 MHz and a stopband starting at 20 MHz. Thus, a chirp signal from 1 MHz - 25 MHz should come out of the FIR unattenuated from 1 MHz - 10 MHz then proportionally decrease through the transition band of the FIR until it hits 20 MHz, at which point my FIR should be attenuating the output to zero.
The AXI Stream interface signals are very simple in this case, mainly needing to monitor and control the tlast, tready and tvalid signals which is the majority of the state machine's functionality since the phase increment is just a big counter basically:
module phase_inc_sm(
input clk,
input reset,
output reg m_axis_phase_tvalid,
output reg m_axis_phase_tlast,
input m_axis_phase_tready,
output reg [31:0] m_axis_phase_tdata
);
reg [31:0] carrier_freq;
reg [31:0] carrier_period;
// 1 MHz
wire [31:0] carrier_freq_1m;
wire [31:0] carrier_period_1m;
assign carrier_freq_1m = 32'h51EB85;
assign carrier_period_1m = 32'd100;
// 25 MHz
wire [31:0] carrier_freq_25m;
wire [31:0] carrier_period_25m;
assign carrier_freq_25m = 32'h8000000;
reg [2:0] state_reg;
reg [31:0] period_wait_cnt;
parameter init = 3'd0;
parameter SetCarrierFreq = 3'd1;
parameter SetTvalidHigh = 3'd2;
parameter SetSlavePhaseValue = 3'd3;
parameter CheckTready = 3'd4;
parameter WaitState = 3'd5;
parameter SetTlastHigh = 3'd6;
parameter SetTlastLow = 3'd7;
always @ (posedge clk or posedge reset)
begin
// Default Outputs
if (reset == 1'b0)
begin
m_axis_phase_tdata[31:0] <= 32'd0;
state_reg <= init;
end
else
begin
case(state_reg)
init : //0
begin
period_wait_cnt <= 32'd0;
m_axis_phase_tlast <= 1'b0;
m_axis_phase_tvalid <= 1'b0;
carrier_freq <= carrier_freq_1m;
state_reg <= SetCarrierFreq;// WaitForStart;
end
SetCarrierFreq : //1
begin
if (carrier_freq > carrier_freq_25m)
begin
carrier_freq <= carrier_freq_1m;
end
else
begin
carrier_freq <= carrier_freq + carrier_freq_1m;
end
carrier_period <= carrier_period_1m;
state_reg <= SetTvalidHigh;
end
SetTvalidHigh : //2
begin
m_axis_phase_tvalid <= 1'b1;
state_reg <= SetSlavePhaseValue;
end
SetSlavePhaseValue : //3
begin
m_axis_phase_tdata[31:0] <= carrier_freq;
state_reg <= CheckTready;
end
CheckTready : //4
begin
if (m_axis_phase_tready == 1'b1)
begin
state_reg <= WaitState;
end
else
begin
state_reg <= CheckTready;
end
end
WaitState : //5
begin
if (period_wait_cnt >= carrier_period)
begin
period_wait_cnt <= 32'd0;
state_reg <= SetTlastHigh;
end
else
begin
period_wait_cnt <= period_wait_cnt + 1;
state_reg <= WaitState;
end
end
SetTlastHigh : //6
begin
m_axis_phase_tlast <= 1'b1;
state_reg <= SetTlastLow;
end
SetTlastLow : //7
begin
m_axis_phase_tlast <= 1'b0;
state_reg <= SetCarrierFreq;
end
endcase
end
end
endmodule
Save the phase increment state machine Verilog code then return to the block design where the Add Module option will now show it as a valid module to add to the block design:
Finally, the FIR Verilog module needs somewhere for its output samples to go with a valid AXI Stream interface, otherwise the whole design won't run. This is due to the nature of the AXI Stream interface: if the AXI Stream signals aren't active at any point in the chain, it will trigger all of the modules to stop taking in or output data since each module is using the tvalid and tready signals to control the data flow.
As a solution, I wrote another very simple state machine in Verilog that outputs a constant tready signal to an upstream device (the FIR module in this case) and simply takes each sample from the tdata line and writes it to the same register when the tvalid and tkeep signals from the FIR are asserted.
module mem_dump_sm(
input clk,
input reset,
input signed [31:0] s_axis_mem_tdata,
input [3:0] s_axis_mem_tkeep,
input s_axis_mem_tlast,
input s_axis_mem_tvalid,
output s_axis_mem_tready
);
assign s_axis_mem_tready = 1'b1;
reg signed [31:0] mem_location;
always @ (posedge clk)
begin
if (s_axis_mem_tkeep == 4'hf && s_axis_mem_tvalid == 1'b1)
begin
mem_location <= s_axis_mem_tdata;
end
else
begin
mem_location <= mem_location;
end
end
endmodule
Add the memory dump state machine to the block design the same way as before with the Add Modules option:
Connect the memory dump AXI Stream interface to the output of the FIR module and verify the block design.
Mark each of the AXI Stream interface connection lines for debug (right-click and select Debug...). Run the connection automation option that pops up after marking each of the interface lines for debug, ensuring the AXI Stream protocol checker is enabled on all of them.
After running this connection automation, ILAs (integrated logic analyzers) will appear in the block design, triggering me to regenerate the layout and rerun validation on the block design:
Save the block design and create an HDL wrapper for it to instantiate it (selecting the option to let Vivado auto-manage it):
It's worth noting for the Sources hierarchy with HDL wrapper changes to pull the FIR, phase increment state machine, and memory dump state machine Verilog files under the block design now. This demonstrates the role of the HDL wrapper in bridging the last gap between the block design and other source files in a Vivado project.
At this point the design is complete and ready to run synthesis on, implementation for place and route, and finally generate a bitstream for the Arty board. I like to select the Generate Bitstream option from the tool bar along the top of the Vivado window, as it will automatically kick off a synthesis run and implementation when needed. Since I've run each of these pieces in behavioral simulation separately in the past, I'm taking a risk and jumping straight to running this design on actual hardware (spoiler: this wasn't the best idea I've ever had....).
To prep the Arty board for this particular use case, the jumpers must be configured for it to boot its QSPI flash memory (where the bitstream is being flashed to) and to supply its power from the USB host instead of looking to the regulator connect to the barrel jack for power from an A/C outlet (since we're not booting from the SD card and tryin to run a Linux image, USB power is enough).
After connecting the Arty board to the host PC, launch Hardware Manager and select the Auto Connect option.
Once opened, nothing will appear in the ILA window in the Hardware Manager since the bitstream hasn't been uploaded to the FPGA yet, and the option to program the device will appear at the top of the window.
Vivado will auto-fill the generated bitstream files from the project in the pop up window to flash bitstream onto FPGA. Once completed, the Hardware Manager window will refresh to show the ILA widows that were added by marking the AXI stream interfaces for debug in the block design.
Since the design was set up in such a way that it will forever run in the loop of generating and filtering the chirp signal, run the immediate trigger option at any time in the ILA to see the result.
It is at this point that I realized that I got too confident by skipping straight to deploying this design on hardware and skipping the stage of running a behavioral simulation before running synthesis, implementation, and generating a bitstream. The ILAs revealed the awful truth that my FIR wasn't filtering anything, the entirety of the chirp signal from 1 MHz - 25 MHz was passing through it completely unattenuated.
That wasn't even my only problem, as I zoomed into the output of my DDS Compiler at the higher end of the chirp (15 MHz and above) I found that the sine wave didn't look like a smooth, full wave. It was just an irregular zig zag looking pattern.
After some digging into many forums and the the product guide for the Xilinx DDS Compiler (PG141), I found that using the Hardware Parameter option was a risky thing to do since it takes your specified bit widths and calculates its own spurious-free dynamic range (SFDR) and frequency resolution for the output sine wave.
SFDR is an important parameter for a DDS Compiler as it tells the DDS how big the ratio needs to be between the power level of fundamental frequency and any spurious signals around it in order to filter them out. If the SFDR is too low, it will allow too many spurious signals through with the desired frequency, giving you a messy output like what I was getting (which the higher the frequency of the desired output signal is, the more susceptible it is to this). On the other hand, if the SFDR is set too high it will filter everything out including the desired fundamental frequency since the delta between it and any spurious signal(s) isn't large enough.
Another fun fact if you're forcing certain hardware parameters, the DDS can also sometimes compensate by setting a very un-useful frequency resolution such as 1500 Hz.
I found (through much trial and error using behavioral simulation) that in order to generate the cleanest possible signal up to 25 MHz, while still maintaining my 32-bit input width and 16-bit output width on the tdata lines, using System Parameters with a 90 dB SFDR and 0.2-Hz frequency resolution worked the best.
To verify that I had the DDS Compiler settings right, I added a FIR compiler and gave it all of the same settings to mimic the desired behavior of my custom FIR Verilog such as the same coefficient values, 100 MHz sample rate, 100 MHz FPGA system clock frequency, and AXI Stream interface:
Since I learned my lesson that I need to still prove in each new design with a behavioral simulation, I right-clicked on each AXI Stream interface line in the block design and chose Mark Simulation instead so that it will be auto populated in the wave window of the behavioral simulation when I run it.
Overall, I ended up with the following block design where the phase increment state machine, DDS Compiler settings, and memory dump state machines are the same for both data paths, the only difference is one path implements my custom FIR Verilog module while the other has the Xilinx FIR compiler IP:
Adding a new simulation source to the design this time (Add Sources --> Add or create simulation sources) I added a simple test bench just to provided the reset and system clock to get the design to run in simulation (don't forget to set this test bench file as the top level file under simulation sources in the Hierarchy tab - see my previous tutorial on test benching/behavioral simulation if you're not familiar with this step).
Test bench code:
module fir_tb;
reg clk, reset;
/*
* Generate a 100Mhz (10ns) clock
*/
always begin
clk = 1; #5;
clk = 0; #5;
end
always begin
reset = 0; #40;
reset = 1; #1000000000;
end
fir_design fir_design_i(
.clk(clk),
.reset(reset));
endmodule
I also re-generated my filter coefficients (I will cover generating filter coefficients in a future installment in this series) to check my math on the fixed-point conversion and to widen the transition band of the filter a bit to improve my filter's frequency response and get greater attenuation in the stopband. I previously had the transition band from 10 MHz - 16 MHz and I increased it to 10 MHz - 20 MHz.
Despite all of this though, I still found in the simulation that my FIR was not filtering anything I put through it, while the FIR compiler IP filtered the same DDS output chirp signal perfectly as expected with the exact same parameters as my FIR.
I spent a lot of time playing with the rate at which the tvalid signal was clocking data into my FIR module since the most common cause for a FIR not to filter is that it's processing the same sample more than once at some point in the logic.
This ultimately led to me discovering that the way I had coded the accumulator stage of my FIR to work parallel with the circular buffer and multiplier stages was doing exactly this: accumulating the same samples more than one time.
To fix this, I pulled the accumulator out into its own always @
block, but left the circular buffer and multiplier stages running in parallel:
module FIR(
input clk,
input reset,
input signed [15:0] s_axis_fir_tdata,
input [3:0] s_axis_fir_tkeep,
input s_axis_fir_tlast,
input s_axis_fir_tvalid,
input m_axis_fir_tready,
output reg m_axis_fir_tvalid,
output reg s_axis_fir_tready,
output reg m_axis_fir_tlast,
output reg [3:0] m_axis_fir_tkeep,
output reg signed [31:0] m_axis_fir_tdata
);
/* This loop controls the tkeep signal on the master & slave AXI Stream interfaces */
always @ (posedge clk)
begin
m_axis_fir_tkeep <= 4'hf;
end
/* This loop controls the tlast signal on the master & slave AXI Stream interfaces */
always @ (posedge clk)
begin
if (s_axis_fir_tlast == 1'b1)
begin
m_axis_fir_tlast <= 1'b1;
end
else
begin
m_axis_fir_tlast <= 1'b0;
end
end
// 15-tap FIR
reg enable_fir;
reg signed [15:0] buff0, buff1, buff2, buff3, buff4, buff5, buff6, buff7, buff8, buff9, buff10, buff11, buff12, buff13, buff14;
wire signed [15:0] tap0, tap1, tap2, tap3, tap4, tap5, tap6, tap7, tap8, tap9, tap10, tap11, tap12, tap13, tap14;
reg signed [31:0] acc0, acc1, acc2, acc3, acc4, acc5, acc6, acc7, acc8, acc9, acc10, acc11, acc12, acc13, acc14;
/* Taps for LPF running @ 100MSps with passband from 0 Hz - 10 MHz */
/* and a stopband from 20 MHz - 50 MHz */
assign tap0 = 16'hfe64;
assign tap1 = 16'hfc8a;
assign tap2 = 16'hfc04;
assign tap3 = 16'hff93;
assign tap4 = 16'h0883;
assign tap5 = 16'h14ef;
assign tap6 = 16'h1ff7;
assign tap7 = 16'h2463;
assign tap8 = 16'h1ff7;
assign tap9 = 16'h14ef;
assign tap10 = 16'h0883;
assign tap11 = 16'hff93;
assign tap12 = 16'hfc04;
assign tap13 = 16'hfc8a;
assign tap14 = 16'hfe64;
/* This loop controls the tready & tvalid signals on the master & slave AXI Stream interfaces */
always @ (posedge clk)
begin
if(reset == 1'b0 || m_axis_fir_tready == 1'b0 || s_axis_fir_tvalid == 1'b0)
begin
enable_fir <= 1'b0;
s_axis_fir_tready <= 1'b0;
m_axis_fir_tvalid <= 1'b0;
end
else
begin
enable_fir <= 1'b1;
s_axis_fir_tready <= 1'b1;
m_axis_fir_tvalid <= 1'b1;
end
end
/* Circular buffer w/ multiply stages of FIR */
always @ (posedge clk)
begin
if(enable_fir == 1'b1)
begin
buff0 <= s_axis_fir_tdata;
acc0 <= tap0 * buff0;
buff1 <= buff0;
acc1 <= tap1 * buff1;
buff2 <= buff1;
acc2 <= tap2 * buff2;
buff3 <= buff2;
acc3 <= tap3 * buff3;
buff4 <= buff3;
acc4 <= tap4 * buff4;
buff5 <= buff4;
acc5 <= tap5 * buff5;
buff6 <= buff5;
acc6 <= tap6 * buff6;
buff7 <= buff6;
acc7 <= tap7 * buff7;
buff8 <= buff7;
acc8 <= tap8 * buff8;
buff9 <= buff8;
acc9 <= tap9 * buff9;
buff10 <= buff9;
acc10 <= tap10 * buff10;
buff11 <= buff10;
acc11 <= tap11 * buff11;
buff12 <= buff11;
acc12 <= tap12 * buff12;
buff13 <= buff12;
acc13 <= tap13 * buff13;
buff14 <= buff13;
acc14 <= tap14 * buff14;
end
end
/* Accumulate stage of FIR */
always @ (posedge clk)
begin
if (enable_fir == 1'b1)
begin
m_axis_fir_tdata <= acc0 + acc1 + acc2 + acc3 + acc4 + acc5 + acc6 + acc7 + acc8 + acc9 + acc10 + acc11 + acc12 + acc13 + acc14;
end
end
endmodule
I reran my behavioral simulation with new DDS settings, new coefficients, and new FIR Verilog code, and to my delight my FIR starting filtering perfectly. Comparing its output to the output of the FIR Compiler IP showed no difference between the two, making me feel confident that this logic revision of my FIR Verilog was the one!
Finally, I reran synthesis, implementation and generated a new bitstream to verify there were no other hidden errors in my FIR Verilog such as timing violations, etc.
And that's it! My FIR Verilog is ready for new applications and use cases for future projects.
Comments