The Beginnings of Small AI?

We have reached a tipping point when it comes to AI, with signs that it is getting smaller and moving to the edge. Say "hello" to Small AI.

We have spent more than a decade building large scale infrastructure in the cloud. We built silos, warehouses, and lakes. (📷: Midjourney)

We have arguably reached a tipping point when it comes to generative AI, and the only question that really remains is not whether these models will become common, but how will we see them used. While there are worrying outstanding problems with how they viewed and how they are currently being used, I think we're now seeing some interesting signs that like the machine learning models that came before them, generative AI is moving offline and to the edge. Repeating the process we saw with tinyML, we're seeing the beginnings of a Small AI movement.

We have spent more than a decade building large scale infrastructure in the cloud to manage big data. We built silos, warehouses, and lakes. But over the last few years, it has become — perhaps — somewhat evident, that we may have made a mistake. The companies we trusted with our data, in exchange for our free services, have not been careful with it. However, in the last few years we've seen the arrival of hardware designed to run machine learning models at vastly increased speeds, and inside a relatively low power envelopes, without needing a connection to the cloud. With it edge computing, previously seen only as the domain of data collection rather than data processing, became a viable replacement to the big data architectures of the previous decade.

But just as we were beginning to think that the pendulum of computing history had taken yet another swing, away from centralised and back again to distributed architectures, the almost overly dramatic arrival of generative AI in the last two, or three years, changed everything. Yet again.

Because generative AI models needed the cloud. They need the resources that the cloud can provide. Except of course, when they don't. Because it didn't take very long before people were running models like Meta's LLaMa locally.

Crucially this new implementation of LLaMa used four-bit quantization. A technique for reducing the size of models so they can run on less powerful hardware, quantization has been widely used for models running on microcontroller hardware at the edge, but before hadn't previously been considered for larger models, like LLaMA. In this case it reduced the size of the model, and the computational power needed to run it, from Cloud-sized proportions down to laptop-sized ones. It meant that you could run LLaMa on hardware no more powerful than a Raspberry Pi.

But unlike standard tinyML, where we're looking at models with an obvious purpose on the edge, models performing object detection or classification, vibration analysis, or other sensor-related tasks, generative AI doesn't have a place at the edge. At least not past proving it could be done.

Except that, the real promise of the Internet of Things wasn’t novelty lightbulbs. It was the possibility that we could assume computation, that we could assume the presence of sensors around us, and that we could leverage that to do more. Not just to turn lightbulbs on, and then off again, with our phones.

I think the very idea that hardware is just “software wrapped in plastic” has done real harm to the way we have built smart devices. The way we talk to our hardware is an inherited artefact of how we write our software. The interfaces that our hardware presents look like the software underneath — just like software subroutines. We can tell our things to turn on or off, up or down. We send commands to our devices, not requests.

We have taken the lazy route and decided that hardware, physical things, are just like software, but coated in plastic, and that isn’t the case. We need to move away from the concept of smart devices as subroutines, and start imbuing them with agency. However, for the most part, the current generation of smart devices are just network connected clients for machine learning algorithms running in the cloud in remote data centres.

But if there isn’t a network connection, because there isn’t a need to connect to the cloud, the attack surface of a smart device can get a lot smaller. But the main driver towards the edge, and using generative AI models there, rather than in the cloud, is not really technical. It’s not about security. It’s moral and ethical.

We need to make sure that privacy is designed into our architectures. Privacy for users is easier to enforce if the architecture of your system doesn’t require data to be centralised in the first place, which is a lot easier if your decisions are made on the edge rather than in the cloud.

To do so we need to optimise LLMs to run in those environments, and we're starting to see some initial signs that this is a real consideration for people. The announcement that Google is going to deploy the Gemini Nano model to Android phones to provide scam call detection features in real-time, offline is a solid leading indicator that we may be moving in the right direction.

We're also seeing interesting architectures evolving where our existing tinyML models are used as triggers for more resource intensive LLM models by using keyframe filtering. Here instead of continuously feeding data to the LLM the tinyML model is used to identify keyframes — critical data points showing significant change — which can be forwarded to the larger LLM model. Prioritising these key frames significantly reduces the number of tokens presented to the LLM allowing it to be smaller and leaner, and run on more resource constrained hardware.

However despite the ongoing debate around what open source really means when it comes to machine learning models, I think the most optimistic signs that we could see that we're looking at a future where generative AI is running close to the edge — with everything that means for our privacy — is the fact that a lot of people want to do it. There are whole communities built around the idea that of course you should be running your LLM locally on your own hardware, and the popularity of projects like Ollama, GPT4All, and llama.cpp, amongst others, just underscores the demand to do that.

If we want to walk an ethical path forward, towards the edge of tomorrow, that provides a more intuitive and natural interface for real-world interactions. Then we need to take the path without the ethical and privacy implications that running our models centrally would imply, we need Small AI. We need "open source" models, not another debate around what open source means, and we need tooling and documentation that makes running those models locally easier than doing it in the cloud.

aallan

Scientist, author, hacker, maker, and journalist. Building, breaking, and writing. For hire. You can reach me at 📫 alasdair@babilim.co.uk.

Latest Articles