Just Take a Little Off the Top

TinyML models deployed on low-power MCUs with DTMM's filterlet pruning outperform existing methods, balancing compression and accuracy.

Overview of the DTMM approach (📷: L. Han et al.)

Across various industries, tinyML models have demonstrated their adaptability and versatility by finding numerous applications. For instance, in the industrial sector, these models have proven highly valuable for predictive maintenance of machinery. By deploying tinyML models on hardware platforms based on low-power microcontrollers, industries can continuously monitor equipment health, proactively schedule maintenance, and detect potential failures. This proactive approach reduces downtime and operational costs. The cost-effectiveness and ultra-low power consumption of these models make them ideal for widespread deployments. Furthermore, tinyML models facilitate the analysis of data directly on the device, ensuring real-time insights while preserving privacy.

However, while on-device processing offers clear benefits, the severe resource limitations of low-power microcontrollers present substantial challenges. Model pruning has emerged as a promising solution, enabling the reduction of model size to fit within the constrained memory of these devices. Nonetheless, a dilemma arises in balancing the trade-off between deep compression for enhanced speed and the need to maintain accuracy. Current approaches often prioritize one aspect over the other, overlooking the need for a balanced compromise.

A trio of engineers at the City University of Hong Kong is seeking to find a better balance between inference speed and model accuracy with a new library they have developed called DTMM. This library plugs into the popular open-source TensorFlow Lite for Microcontrollers toolkit for designing and deploying machine learning models on microcontrollers. DTMM takes an innovative approach to pruning that allows it to produce models that are simultaneously highly compressed and accurate.

Different approaches to model pruning (📷: L. Han et al.)

Existing systems, like TensorFlow Lite for Microcontrollers, use a strategy called structured pruning that removes entire filters from a model to reduce its size. While this method is easy to implement, it may remove many useful weights, affecting accuracy when high compression is needed. For this reason, another technique, called unstructured pruning has been developed. This method targets individual weights rather than entire filters, preserving accuracy by removing less important weights. However, it faces challenges in terms of additional storage costs and compatibility issues with existing machine learning frameworks, making inference slower.

With both speed and storage space being in short supply on tiny computing platforms, this approach is often unworkable on these devices. DTMM, on the other hand, leverages a new technique that the team calls filterlet pruning. Instead of removing entire filters or individual weights, DTMM introduces a new unit called a "filterlet," which is a group of weights in the same position across all channels in a filter. This approach makes use of the observation that weights in each filterlet are stored contiguously on the microcontroller, which makes for more efficient storage and faster model inferences.

Weights across filter channels are stored contiguously (📷: L. Han et al.)

To evaluate their system, the researchers benchmarked DTMM against a pair of existing, state-of-the-art pruning methods, namely CHIP and PatDNN. The comparison considered factors like model size, execution latency, runtime memory consumption, and accuracy after pruning. DTMM outperformed both CHIP and PatDNN in terms of model size reduction, achieving a 39.53% and 11.92% improvement on average, respectively. In terms of latency, DTMM performed better, surpassing CHIP and PatDNN by an average of 1.09% and 68.70%, respectively. All three methods satisfied runtime memory constraints, but PatDNN faced challenges due to high indexing overhead in some cases. DTMM demonstrated higher accuracy for pruned models, maintaining better performance even as the model size decreased. The analysis revealed that DTMM allowed selective pruning of weights from each layer, with 37.5-99.0% of weights pruned across layers. Additionally, the structure design of DTMM effectively minimized indexing and storage overhead.

The remarkable gains seen when compared to state-of-the-art methods show that DTMM could have a bright future in the world of tinyML.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles