Researchers Deliver Dramatic Performance, Efficiency Gains for LLMs with the FPGA-Driven TerEffic
Memory-saving modifications deliver two orders of magnitude the performance compared to GPU-based edge AI accelerators.
Researchers from Peking University and the National University of Singapore have come up with a way to run large language models (LLMs) on relatively compact FPGA hardware β claiming to deliver performance 149 times higher than NVIDIA's Jetson Orin Nano at 19 times the power efficiency: TerEffic.
"Large language model (LLM) deployment on edge devices is typically constrained by the need for off-chip memory access, leading to high power consumption and limited throughput," the researchers explain of their work. "Ternary quantization for LLMs is promising in maintaining model accuracy while reducing memory footprint. However, existing accelerators have not exploited this potential for on-chip inference. We present TerEffic, an FPGA-based accelerator that carefully co-designs memory architecture and computational units to unlock highly efficient LLM inference with fully on-chip execution."
Typically, large language models β the models behind chatbots like ChatGPT, Claude, DeepSeek, and others β require a hefty amount of memory for inference, meaning edge devices can only run smaller models. Quantization offers a trade-off between accuracy and memory needs β an approach exploited in the design of TerEffic to allow LLMs to run either entirely within the hardware of an FPGA or with off-chip high-bandwidth memory (HBM).
"Through weight compression, custom computational units, and memory hierarchy optimization, we achieve unprecedented efficiency by eliminating off-chip memory bandwidth bottlenecks," The team claims. "We propose two architectural variants: a fully on-chip design for smaller models and an HBM-assisted design for larger ones."
In testing, both approaches showed promise. For the on-chip approach, the team was able to implement a large language model with 370 million parameters and achieve a performance of 12,700 tokens per second β a claimed 149 times higher than the same model running on NVIDIA's Jetson Orin Nano system-on-module and its powerful GPU, at a power efficiency 19 times higher than the Jetson Orin Nano at 467 tokens per second per Watt.
The second test added high-bandwidth memory to the FPGA hardware in order to run a larger model, with 2.7 billion parameters. This, the researchers claim, delivered 521 tokens per second or twice that of NVIDIA's high-end A100 accelerator while drawing only 33W in power β eight times less than the A100, at 16 tokens per second per Watt.
"Our work establishes a foundation for future research in hardware-efficient LLM deployment," the researchers conclude, "particularly in resource-constrained environments where power efficiency is paramount."
The team's work is available in open-access preprint on Cornell's arXiv server.