What Have You Done for Me, Latency?

IBM Research's NorthPole chip blurs the line between processing and memory to speed up AI algorithms while reducing energy consumption.

The NorthPole chip merges processing and memory for fast AI processing (📷: IBM Research)

In recent years, the surge in interest in Artificial Intelligence (AI) has been remarkable, forever changing the way businesses and industries operate. What was once considered an experimental technology has rapidly evolved into an integral component of enterprise-scale operations across various sectors, including finance, healthcare, manufacturing, and more. Organizations are increasingly leveraging AI to enhance decision-making processes, optimize operations, and create personalized experiences for their customers. This transition to mainstream adoption has been fueled by advancements in computing power, algorithmic breakthroughs, and the availability of vast datasets, enabling AI to tackle complex problems with newfound efficiency and accuracy.

However, modern computing technologies, which rely on the von Neumann architecture, were not originally designed to handle the intricacies of AI algorithms. This traditional architecture, characterized by separate processing and memory systems, creates a bottleneck known as the von Neumann bottleneck. This arises due to the limited bandwidth between the central processing unit and the memory, leading to significant latencies in processing as frequent, slow data fetch operations are required.

As AI applications demand increasingly large amounts of data for processing, the inefficiencies caused by this bottleneck will begin to severely hamper performance. This issue will become more pronounced with the escalating complexity of AI models and the growing size of datasets. Despite the advancements in processing power and even the emergence of specialized hardware for AI, present limitations continue to impede the full realization of AI's potential.

The NorthPole chip integrated into a PCIe card (📷: IBM Research)

Without significant breakthroughs in hardware architecture that can mitigate these limitations, the pace of progress in AI will inevitably slow down, hindering the adoption of new applications and restricting the technology's ability to deliver its full range of benefits. Engineers at IBM Research have put forth a possible new path forward with their NorthPole chip architecture. Inspired by the function of the brain, the team designed a novel type of chip that blurs the lines between processing and memory. NorthPole was demonstrated to be much more performant, and more energy-efficient, than traditional computing architectures when running AI algorithms.

NorthPole, fabricated with a 12-nanometer process, interweaves both processing and memory units on a single chip. In the space of 800 square millimeters, the chip packs in 22 billion transistors that make up its 256 cores. At an 8-bit level of precision, NorthPole can perform 2,048 operations per core during each clock cycle. If less precision is required, the number of operations per cycle can be doubled or quadrupled by specifying a 4- or 2-bit precision level, respectively. This unique design was found to outperform all major existing architectures, even GPUs manufactured with a more advanced 4-nanometer process. Importantly, these performance gains were accompanied by major reductions in energy consumption.

This new technology is limited by the amount of memory it has on-chip, however, so the size of the neural networks it can host is limited. It was not designed to power the latest large language model, for example. NorthPole is much better suited for edge computing applications, where a blend of performance and efficiency is required. As an added benefit, the efficiency of the chip means that elaborate cooling systems are not needed, which enhances its suitability for edge and portable use cases.

Some practical experiments were carried out to test the real-world performance of this new chip architecture. It was discovered that when running the ResNet50 model for image classification, NorthPole exhibited 22 times less latency than a GPU manufactured with a 12-nanometer process, and did so with 25 times higher energy efficiency.

Looking ahead, the team plans to explore fabricating NorthPole chips with a more modern process. Beyond that, they are investigating the types of applications that benefit most from their novel architecture.

nickbild

R&D, creativity, and building the next big thing you never knew you wanted are my specialties.

Latest Articles