MATCH Abstracts Away Hardware Differences to Deliver Better TinyML on a Range of Microcontrollers
Clever abstraction delivers major performance gains, without the complexity of starting from scratch for each new target device.
Researchers from Italy's Politecnico di Torino, KU Leuven in Belgium, IMEC, and the University of Bologna have come up with a way to boost the performance of deep neural networks (DNNs) running on microcontrollers — without having to start everything from scratch for each target platform.
"Streamlining the deployment of Deep Neural Networks (DNNs) on heterogeneous edge platforms, coupling within the same microcontroller unit (MCU) instruction processors and hardware accelerators for tensor computations, is becoming one of the crucial challenges of the tinyML field," the team explains of the problem it set out to solve. "The best-performing DNN compilation toolchains are usually deeply customized for a single MCU family, and porting to a different heterogeneous MCU family implies labor-intensive re-development of almost the entire compiler. On the opposite side, retargetable toolchains, such as [Apache] TVM, fail to exploit the capabilities of custom accelerators, resulting in the generation of general but unoptimized code."
The solution proposed by the team: MATCH, Model-Aware TVM-based Compilation for Heterogeneous Edge Devices. This, the researchers explain, delivers a deployment framework for DNNs which allows for rapid retargeting across different microcontrollers and accelerators at greatly reduced effort — by adding a model-based hardware abstraction layer, but leaving the model itself alone.
"Starting from a Python-level DNN model," the researchers explain, "MATCH generates optimized HW [hardware]-specific C code to deploy the DNN on OS-less heterogeneous devices. To extend MATCH to a new HW target, we provide the MatchTarget class, which can encompass one or more HW Execution Modules. Each HW Execution Module contains four key components: Pattern Table, [which] lists the supported patterns for the module; the Cost Model [which] is used for generating the correct schedule for each supported operator pattern; a set of Network Transformations […] to be applied to the neural network both before and after graph partitioning; [and] finally a Code Generation Backend."
To prove the potential of the system, the researchers tested it out on two microcontroller platforms: GreenWaves' Internet of Things-oriented GAP9 and the DIgital-ANAlog (DIANA) artificial intelligence processor, both based on the free and open RISC-V instruction set architecture. Using the MLPerf Tiny benchmark suite, MATCH delivered a 60-fold latency improvement on DIANA compared to using Apache TVM alone, and a 16.94 per cent latency improvement over the DIANA-specific HTVM customized toolchain; for GAP9, it delivered a twofold improvement over the dedicated DORY compiler.
"Differently from other target-specific toolchains, MATCH does not embed hardware-dependent optimizations or heuristics in the code but rather exposes an API [Application Programming Interface] to define high-level model-based hardware abstractions, fed to a generic and flexible optimization engine," the team claims. "As a consequence, adding support for a new HW module becomes significantly easier, avoiding complex optimization pass re-implementations. A new HW target can be added in less than one week of work."
A preprint detailing MATCH is available on Cornell's arXiv server, under open access terms.