Every Little Bit Counts
Tiled Bit Networks compress parameters to sub-bit levels using reusable binary tiles, reducing hardware demands and memory by 2.8 times.
Advances in computing technologies — and in particular, in specialized hardware like GPUs and TPUs — have led to a boom in the development of deep neural networks. These powerful algorithms are the key to many successful applications in artificial intelligence, such as large language models and text-to-image generators. However, when it comes to deep neural networks, large never seems to be quite large enough. These models may have architectures composed of many billions, or even over a trillion, parameters. At some point, even the latest and greatest in hardware will be brought to its knees by the expansion of these networks.
Yet many research papers have suggested that very large models are in fact crucial for deep learning, so trimming them back to fit within the bounds of available hardware resources is likely to stymie technological progress. A team headed up by researchers at Colorado State University has proposed a new type of deep learning architecture that allows more parameters to be handled by less hardware, opening the door to larger deep learning models without requiring corresponding advances in hardware.
Their proposed architecture, called Tiled Bit Networks (TBNs) build on an existing network architecture called Binary Neural Networks (BNNs). Whereas traditional neural networks encode their parameters into float or integer values that require multiple bytes of storage each, and sufficient processing power to perform mathematical operations on these numbers, BNN parameters are strictly binary. This reduces the storage of a parameter to a single bit, and allows for quick and efficient mathematical operations during both training and inferencing.
This leads to big improvements in terms of processing speed and utilization of hardware resources, but remember — large is never large enough in deep learning. Even with these optimizations, networks are still blowing past the limits of what is possible with modern computing systems. That is where TBNs come in.
The tiling process in TBNs is a method for compressing neural network weights by learning and reusing compact binary sequences, or “tiles,” across the network’s layers. During training, rather than learning a unique set of binary weights for each parameter, TBNs learn a small set of binary tiles that can be strategically placed to reconstruct the weights in each layer. The process begins by identifying a small, reusable set of binary patterns that approximate the model's original weight values. These binary tiles are then used to fill in the network layers via tensor reshaping and aggregation, which means reorganizing the model's weight structure to allow the tiles to cover the required dimensions effectively.
Each tile, in combination with a scalar factor applied per layer or tile, provides sufficient representation for the layer’s functionality, achieving a high compression rate by minimizing redundancy. The use of tiles enables TBNs to drastically reduce memory requirements, as only a single tile per layer needs to be stored, and this tile can be reused during both training and inference to approximate the full set of weights. This approach is highly efficient because it leverages the inherent patterns within weight matrices, thereby reducing the need for unique binary values across the entire network. The result is a deep neural network that requires less than one bit per parameter, significantly slicing computational costs.
On two-dimensional and three-dimensional image classification tasks, TBNs performed comparably to BNNs, while using fewer parameters. Additionally, TBNs showed strong results in semantic and part segmentation, as well as time series forecasting.
To demonstrate what a practical model deployment looks like, TBNs were implemented on both microcontrollers and GPUs. On microcontrollers, TBNs significantly reduced memory and storage consumption, while on GPUs, they achieved a 2.8x reduction in peak memory usage compared to standard kernels in applications like vision transformers. These results highlight the effectiveness of TBNs in compressing neural networks without sacrificing performance.