The Fast Fourier Transform (FFT) is a critical algorithm in digital signal processing, widely used in applications like radar, communications, and spectral analysis. Implementing FFT in Vitis High-Level Synthesis (HLS) allows for efficient design on FPGAs, leveraging the parallel processing capabilities of programmable logic while maintaining flexibility through software-level programming.
Key features of FFT implementation in Vitis HLS include:
- Pipelining: Ensures high throughput by processing multiple data points concurrently, enabling continuous data flow without bottlenecks.
- Fixed-point arithmetic: Reduces hardware resource usage while balancing precision, a common requirement for resource-constrained systems.
- Customization: Offers flexibility in designing the number of FFT points, data widths, and optimization levels for performance or resource trade-offs.
The benefits of using Vitis HLS for FFT design are:
- Hardware acceleration: By offloading FFT computations to the FPGA's programmable logic, the processing system is freed for other tasks.
- Optimized hardware: Vitis HLS allows you to tune initiation intervals, latency, and DSP utilization for an efficient implementation.
- Seamless integration: With support for AXI interfaces, the HLS FFT IP core can be easily integrated into larger systems on platforms like Zynq-7000.
Let's explore the design steps, hardware architecture, and performance evaluation of a fully pipelined, fixed-point FFT IP core using Vitis HLS.
The hardware architecture of FFT implementation with Vitis HLS
The FFT IP core is composed of the following:
- Read and reversing block: Manages input data rearrangement (bit-reversed addressing).
- Cascades: Stages connected via dual-port BRAMs, each containing a butterfly computation block.
- Write function: Handles AXI STREAM master interface for data output.
- Twiddle factors: Stored in ROM memory for efficient coefficient access.
Synthesis and Implementation Report:
- Achieves a fully pipelined design, ensuring continuous data flow with initiation interval (II) = 1 clock cycle.
- Latency is distributed across the subblocks, with cascades introducing a 3-clock latency for data read, processing in DSP, and BRAM write.
- Uses 32-bit AXI STREAM interfaces for input/output, splitting the data into 16-bit real and 16-bit imaginary parts.
- Resource utilization: Minimal DSPs (227), LUTs (8101), and BRAMs (54), optimized for hardware efficiency.
Timeline Trace
Visual representation of the cascaded pipeline, showing how each stage operates in parallel. Stages stall briefly only when waiting for shared BRAM access, ensuring overall latency matches the synthesis report.
Co-simulation Wave Viewer
Captures the timeline from the first input sample to the last output sample, validating the design's pipelined behavior and matching reported latency metrics.
For performance evaluation the following hardware design for Arty z7-20 development board is used:
The Hardware Design integrates:
- Programmable Logic (PL): Contains the HLS FFT IP core, HLS DMA, HLS CFG, and the Xilinx xFFT IP core, interconnected with AXI interfaces.
HLS FFT IP:
- A Vitis HLS implementation of a fully pipelined FFT core.
- Designed for high throughput with fixed-point arithmetic to optimize resource usage.
- Processes input data and outputs frequency domain results through AXI interfaces.
HLS DMA:
- A Vitis HLS implementation of a custom DMA (Direct Memory Access) controller.
- Enables efficient data transfer between the Processing System (PS) and the Programmable Logic (PL).
- Handles input/output data streams for both the HLS FFT and Xilinx xFFT cores.
HLS CFG:
- A Vitis HLS implementation of a configuration controller.
- Dynamically configures the Xilinx xFFT v9.1 IP core, enabling run-time adjustments and flexibility in FFT processing.
Xilinx xFFT v9.1 IP:
- A pre-built Xilinx FFT core integrated into the pipeline.
- Complements the custom HLS FFT implementation by providing additional options for FFT computation.
- Processing System (PS): Powered by the Arm Cortex-A9, which handles configuration, control, and runtime interaction with the programmable logic.
- DDR Memory: Shared between the PS and PL for efficient data exchange.
The absolute error between the fixed-point FFT implemented in Vitis HLS and the floating-point FFT using NumPy. Key insights include:
- The absolute error increases with the FFT length (N) due to the quantization error in the fixed-point arithmetic.
- The error remains small, demonstrating the precision-efficiency trade-off of using fixed-point arithmetic.
- This visualization confirms the effectiveness of the fixed-point FFT implementation for resource-efficient systems while acknowledging limitations in accuracy for higher FFT lengths.
The following plot shows the HLS FFT output for different FFT lengths (256, 1024, and 4096 points)with a focus on the impact of noise and its reduction as the number of points increases.
Red Plot (256-point FFT):
- The noise floor is originally -50 dB in the time domain.
- After applying the 256-point FFT, the total noise power is spread across 256 bins.
- This results in a reduced per-bin noise floor approximately: ≈24dB.
- The effective noise floor in the frequency domain is now -74 dB. This lower noise floor allows the signal peaks (bins 32, 101, and 150) to stand out more clearly.
Blue Plot (1024-point FFT):
- With 1024 points, the noise power is further spread across four times more bins than the 256-point FFT.
- The reduction in noise floor becomes: ≈30dB.
- The effective noise floor is now approximately -80 dB. Signal peaks remain sharp, with even more separation between the signal and noise.
Green Plot (4096-point FFT):
- The 4096-point FFT spreads the noise across an even larger number of bins.
- The noise floor reduction is now: Reduction (dB): ≈36dB.
- The effective noise floor is reduced to around -86 dB, further enhancing the contrast between signal peaks and the noise floor.
More details about FFT implementation with Vitis HLS are available on my website.
Comments
Please log in or sign up to comment.