Yeo Kheng Meng Laughs at Energy-Hungry LLMs, Gets Llama 2 Running Locally on an Intel 486 and MS-DOS
By using some of the paradoxically smallest large language models around, Meng has brought on-device "AI" to vintage systems.
Maker and developer Yeo Kheng Meng has jumped on the large language model bandwagon with a twist: rather than throwing the latest and greatest in high-performance energy-hungry hardware at the problem, he's running the Llama 2 model on vintage laptops running Microsoft's MS-DOS.
"Ever thought of running a local large language model (LLM) on a vintage PC running DOS," Meng asks, rhetorically, in support of the project. "Now you can! Two years ago I wrote a DOS [OpenAI] ChatGPT client and many retrocomputing enthusiasts in the community wrote similar clients for other vintage platforms. They all had a similar issue, dependency on some remote service. What about running things locally? Conventional wisdom states that running LLMs locally will require computers with high performance specifications especially GPUs with lots of VRAM. But is this actually true?"
Large language models are the technology underpinning the current artificial intelligence (AI) boom: statistical models trained on a vast corpus of often illegitimately-sourced materials that accept user input, break them down into "tokens," then return the most statistically-likely tokens in response — forming the shape of an answer, though not always an actual answer itself. It's a process which demands vast computational resources, both for the initial training and point-of-use inference portions of the workload — though if you make a model small enough, distil it down still further, and quantize it appropriately, it's possible to run basic LLMs on surprisingly lightweight systems.
That's the trick to Meng's project. It relies on Andrej Karpathy's llama2.c, a single-file inference engine for Meta's freely-distributed Llama 2 model targeting FP32 precision. "For ease of testing of smaller models, Karpathy trained several small models on the TinyStories dataset that only have sizes of 260k, 15M, 42M and 110M [parameters]," Meng explains. "This is to enable some basic level of LLM functionality on resource-constraint systems. llama2.c, although written for portability in mind, still has some challenges when it comes to making the codebase work for vintage systems."
Using Open Watcom 2 as a compiler and the DOS/32A DOS extender, Meng was able to work through the changes required to get llama2.c running on vintage systems. While Meng's port of the inference engine is capable of putting more compact LLMs on vintage hardware, you'll have to be patient to actually use them: running in MS-DOS 6.22 on an Intel 486 desktop saw the 260k model putting out just 2.08 tokens per second, while jumping to a Toshiba Satellite 315CDT with a Pentium MMX 200 and three times the RAM boosted that to 15.32 tokens per second, or 0.43 tokens per second on the larger 15M parameter model.
A modern desktop running the same code on an AMD Ryzen 5 7600, by contrast, manages 927.27 tokens per second at 260k parameters — or a great deal more if you use a modern inference engine capable of taking advantage of the CPU's inherent capabilities, before even considering the speed-up available from moving to massively parallel execution on a graphics processor or dedicated accelerator.
Meng's full write-up is available on his website, while the source code and ready-to-run executable are available on GitHub under the permissive MIT license.