Choosing the Best NPU for On-Device AI

Researchers benchmarked a wide range of microcontroller-scale NPUs. Which has the right balance of speed and efficiency for your project?

The MAX78000 fared quite well in the benchmarking (📷: Analog Devices)

Reducing costs, enhancing privacy, and minimizing latency are just a few of the reasons why so many organizations are working toward running their artificial intelligence (AI) applications locally, rather than in the cloud. This often means building out on-premises data centers stocked with a healthy supply of GPUs. But for some use cases, even a connection over a local network for AI processing is unacceptable, either due to real-time operating constraints or the unavailability of a network connection for mobile devices.

For these types of applications, on-device processing may be the only practical choice. However, mobile devices are often severely constrained in terms of their available computing resources and energy budgets. Under normal circumstances, running AI algorithms on these types of systems is a no-go. But the development of microcontroller-scale neural processing units (NPUs) is giving low-power mobile devices the ability to get in on the action without unacceptable latency or draining their batteries.

Models were prepared for each tested platform (📷: J. Millar et al.)

Whereas the GPU market is mature and the available options are generally pretty well understood, NPUs are an emerging technology. This leaves developers uncertain as to what chips would best support their applications. Fortunately, a group of researchers at Imperial College London and the University of Cambridge has recently evaluated a number of commercially-available NPUs. Their work makes the pros and cons of these chips known, and also provides a consistent framework for evaluating future NPUs as they become available.

The researchers conducted an in-depth performance analysis of multiple microcontroller-class AI platforms, examining not only end-to-end inference latency and power consumption, but also breaking down the inference process into granular stages. These stages include NPU initialization, memory I/O (input and output tensor movement), inference execution, post-processing, and idle power consumption.

Platforms tested include the MAX78000 (with both Cortex-M4 and RISC-V cores), HX-WE2 (Arm Cortex-M55 with Ethos-U55 NPU), GAP8, STM32H7A3ZI, ESP32-S3, NXP-MCXN947, and the MILK-V RISC-V SoC. Across all tests, each platform was evaluated using the same suite of AI models, including ResidualNet, SimpleNet, CIFAR10-NAS, YoloV1, and an autoencoder.

Among the notable findings, the MAX78000 with its Cortex-M4 CPU offered the best efficiency across all models when NPU initialization was included — consistently delivering sub-30ms latency while maintaining the lowest idle power draw (13.21 mW). However, this came with a caveat — the MAX78000's performance is heavily memory-bound, with memory I/O consuming up to 90% of total inference time.

A comparison of the power and efficiency of tested platforms (📷: J. Millar et al.)

The HX-WE2 platform demonstrated an average 1.93x speedup in latency compared to the MAX78000, albeit at the cost of more than triple the power consumption. Meanwhile, general-purpose MCUs like the STM32H7A3ZI and ESP32 showed significantly poorer efficiency, especially on complex models, reinforcing the value of dedicated neural hardware.

The study also highlighted memory I/O as a major performance bottleneck — particularly for the MAX78000 — and suggested optimization strategies, such as dynamic weight allocation and weight preloading, to reduce this overhead. Platforms like the NXP-MCXN947 exhibited negligible memory I/O latency (as low as 0.05 ms), showing promise for workloads requiring frequent model switching.

This work provides important insights into the trade-offs among low-power AI platforms and emphasizes that certain metrics can be poor predictors of real-world performance. To truly understand how a platform will work for your use case, a more thorough benchmarking is necessary. As more AI workloads move to the edge, studies like this offer a much-needed roadmap for developers seeking the right balance of power, latency, and efficiency.

machine learning

artificial intelligence

tinyml

energy efficiency

Nick Bild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Choosing the Best NPU for On-Device AI

Researchers benchmarked a wide range of microcontroller-scale NPUs. Which has the right balance of speed and efficiency for your project?

Latest articles

Sponsored articles

Related articles

Latest articles

Related articles