Cutting Down Cutting-Edge AI
CALDERA, a new algorithm by Stanford and Princeton, compresses LLMs like Llama 3 for edge computing by reducing redundancies and precision.
The ability of large language models (LLMs) to generate text that appears to have been written by a human has made them very valuable tools for applications ranging from text summarization to translation and even code generation. But the potential benefits of LLMs are not yet being fully realized, which is due in large part to the fact that these algorithms — the best performing of them, anyway — require a massive amount of computational resources for operation. As such, they must run on powerful computer systems in remote data centers, which is not ideal for many use cases.
Sending data over public networks comes with many privacy-related concerns. Furthermore, the latency that this approach introduces prevents LLMs from being integrated into real-time applications. If these algorithms could run on edge computing hardware, a whole new world of possibilities could be opened up to us. That is easier said than done, of course. An algorithm that requires a huge cluster of GPUs and impossibly large amounts of memory cannot just be loaded onto a low-power edge system with limited resources, after all.
Cut it out!
We may be one step closer to making this situation a reality, however, thanks to the work of a team of researchers at Stanford University and Princeton University. They have developed a novel algorithm called Calibration Aware Low precision DEcomposition with low Rank Adaptation (CALDERA) that can slice and dice an LLM to significantly reduce its computational complexity without having a significant impact on its performance.
This is possible because while LLMs are trained to have a deep understanding of natural language, the training process is not always all that efficient. There is a lot of redundancy and otherwise unnecessary information that gets encoded into the weight matrices that power these models. CALDERA looks for these inefficiencies and carves them out to shrink the model down to a more reasonable size, while minimizing any negative impacts on algorithm accuracy.
The researchers took a two-pronged approach in developing CALDERA. The tool seeks to reduce both the precision and the rank of the original model. In more plain terms, this means that the amount of storage space required to store each model weight is reduced, and that redundancies in the weights will be sought out and eliminated. The combination of these optimizations allows for far greater model compression than either can provide on its own.
The future is tiny
Experiments were conducted in which CALDERA was applied to Meta’s popular Llama 2 and Llama 3 LLMs. Significant model compression was achieved, while largely maintaining their performance. These results hint that a future in which everything from laptops to smartphones run cutting-edge LLMs could be right around the corner. But before we fully arrive at that future, more work is necessary. Perhaps other researchers will combine this work with other optimizations to amplify the effect of CALDERA.