A Flashy Way to Run LLMs
Using an approach that efficiently swaps parameters between RAM and flash memory, edge computing devices can run powerful LLMs on-device.
Large language models (LLMs) have burst onto the scene in a big way in recent years, garnering massive amounts of interest for their impressive performance in a wide range of natural language tasks. Perhaps the only aspect of LLMs that is discussed as much as their capabilities is their massive sizes and the tremendous amount of computational resources that are required to run them effectively.
When notable models, like OpenAI’s GPT-4 were released, it was soon learned that many of them had a staggering number of parameters — often well over a trillion. That put the local execution of these models far out of reach for all but large, well-funded organizations. Since that time, many algorithmic advancements have happened, with the open-source community leading the way. Thanks to those efforts, much smaller models, often containing less than ten billion parameters, have achieved levels of performance that rivals their much larger counterparts in many ways.
This dramatic reduction in model size has gone a long way toward democratizing the use of LLMs, to be sure. But now that we have arrived at this point, the natural next step is to run these models on smaller compute platforms, moving from powerful workstations to more energy-efficient edge computing platforms. Unfortunately, this is still a bit out of reach. Even a model with seven billion parameters in half-precision floating point format will require 14 GB of memory — just to store the model parameters.
In the edge computing world, that is a lot of memory. So unless developers can significantly shrink these models that have already been squeezed thin, new approaches are needed to run them on resource-constrained hardware. One such approach was recently unveiled by a team of engineers at Apple. Recognizing that model sizes will likely always be a few steps ahead of what edge devices can handle, they developed a technique that allows LLMs to load only the parameters that are immediately needed in main memory. As additional model parameters are needed, they are pulled into main memory from flash memory.
You may be thinking that this does not sound all that innovative. After all, almost since the advent of permanent storage solutions, they have been used to swap data in and out of main memory to make the most of that limited resource. But it is not so much about the fact that parameters are swapped between main memory and flash as it is about how the team did it.
To maintain acceptable performance, the team focused on two primary factors — minimizing the overall volume of data transferred, and also structuring the transfers in a way that makes the most of the strengths of flash memory. These goals were achieved first by a technique they call “windowing,” which loads parameters for only the past few tokens while reusing activations from recently computed tokens. This sets up a sliding window of data transfers that reduces I/O requests. Further, the team used a row-column bundling method in requesting data from flash memory. By storing a concatenated row and column of the up-projection and down-projection layers, it is possible to read in larger, continuous blocks. Reading from flash memory in this way increases throughput.
Using these methods, a system can efficiently run a model that is twice the size of its available memory. And it is up to five times faster than when swapping data between memory and flash in a naive way when running inferences on a CPU, or up to 25 times faster when using a GPU. The team hopes that their work will help LLMs to reach their full potential in a wide range of devices and applications.