Team TINA: Christiaan Boerkamp, Steven van der Vlugt, Zaid Al-Ars

Published August 1, 2024 © Apache-2.0

TINA: Running non NN algorithmns on an AMD Ryzen NPU!

TINA allows you to run your non-NN operations e.g. matrix multiplication or if statements using any hardware that supports neural networks.

IntermediateWork in progress1 hour590

PC AI: 2nd Place

Pervasive AI Developer Contest

TINA: Running non NN algorithmns on an AMD Ryzen NPU!

Things used in this project

Hardware components

Minisforum Venus UM790 Pro with AMD Ryzen™ 9

Story

TINA

In recent years, specialized processor hardware has been witnessing significant amounts of development accompanied with faster and faster innovation cycles. A decade ago, all a programmer had at their disposal was a CPU with an x86 architecture, either a GPU, and as an edge case an FPGA. Fast forward to the current day and you can see application specific hardware being developed to efficiently meet application specific performance needs. This diversity in hardware while beneficial has resulted in a fragmented software ecosystem in which every type of hardware requires its own unique code to program. Another concern is that these languages are often rather low level, requiring a deep understanding of the hardware topology in order to create efficient implementations. Last but not least, these languages take time to develop and to learn, for example, the first neural processing unit (NPU) accelerators were announced in 2017, while the first use of these accelerators for non-NN applications was only released in 2022. These factors combined severely limit both the efficiency of new types of hardware as well as the available domains where this new hardware can be used to accelerate performance. For this reason, we created TINA, a framework that allows users to accelerate their code across various forms of hardware with the only limitation being that they need to be able to support Neural Networks (NNs). We ensure the portability of this acceleration by relying on a very ubiquitous building block; the convolution. In practice, this means that we “map” functions ranging from linear algebra functions as well as control flow functions to a series of different types of convolutions we refer to as “TINA layers”.

But why use Convolutions? Convolutions are ubiquitous to the current NN acceleration software stack. A quick glance on GitHub and Google Scholar will show you tens of thousands of implementations of convolution accelerators across a large variety of hardware. Within TINA, we aim to stand on the shoulders of these giants in order to create efficient accelerated implementations of dataflow algorithms. In our paper "TINA: Acceleration of Non-NN Signal Processing Algorithms Using NN Accelerators" [9], we further detail how TINA leverages the aforementioned convolution accelerators and how functions can be mathematically mapped on convolutions, together with results of running these functions through CUDA support on the general GPU units as well as Tensor Cores and comparing them to current state of-the art such as Google JAX.

TINA can support many different non-NN application domains such as data processing in astronomy and medical fields. Examples include filtering, beamforming and imaging in ultrasound and radio telescope systems. Through TINA these systems can leverage recent developments in HPC such as Tensor units or NN-specific hardware.

The recent line of NPUs released by AMD is a perfect fit for us to accelerate applications. The AMD Ryzen 7940 HS with its NPU AMD XDNA architecture gives us enough resources to accelerate more computationally intensive applications. For this Hackster competition, we demonstrate accelerating several linear algebra and signal processing functions combined with a more complex algorithm in the form of a Polyphase Filter Bank (PFB).

Polyphase Filter Bank

A PFB channelizes time domain digitized input signals to frequency channels. It is a powerful filter operation that allows for the efficient division of signals into multiple frequency bands, enabling filtering of specific frequency components and simultaneous processing of different frequency components. The PFB is found in many different application domains such as radio astronomy [7] [6], wireless communication [2], radar [5], ultrasound imaging [4] and quantum computing [3].

A PFB is an implementation of a combination of an FFT and a FIR band pass filter. The FFT converts the time domain digitized input signal to frequency channels. However, the FFT introduces spectral leakage in the resulting frequency channels. This spectral leakage can be reduced by applying an FIR band pass filter to each output channel of the FFT.

A more elaborate explanation of polyphase filters is found in [6]. We used the reference implementation provided in this work (https://github.com/telegraphic/pfb_introduction) as a starting point for our implementation with TINA.

The AMD AI Engines (AIEs) were first introduced in the Versal Adaptive System on Chip FPGA’s. In another project, the authors worked on the evaluation of a PFB on the AIE in this device (https://git.astron.nl/RD/acap, [8]). We found that we could efficiently develop a performant dataflow implementation on this architecture which made us curious about the AMD Ryzen NPU.

When the Ryzen NPU was first introduced it was only programmable through Pytorch and TensorFlow. This led us to the exploration of programming a Ryzen NPU with the TINA framework. Very recently the Peano framework was introduced, providing LLVM support for the Ryzen AI (https://www.phoronix.com/news/AMD-Peano-LLVM-Ryzen-AI) architecture. Although Peano allows for more general use of the NPU, our TINA framework still provides the benefits of portability across platforms as well as high-level programmability.

The Mathematics behind TINA

The main idea behind TINA is to represent operations and algorithms via a series of convolutions using so-called TINA layers. In this section, we go into detail for one of the easier operations to map; elementwise matrix multiplication. We have a more in-depth look at the mathematics of 8 more implemented operations in our publication: "TINA: Acceleration of Non-NN Signal Processing Algorithms Using NN Accelerators" [9].

The building block used for elementwise matrix multiplication: Depthwise convolutions

Convolutions are critical in deep learning for feature extraction from various types of input data, such as images and audio. A standard convolution process involves sliding a kernel across the input data, conducting elementwise multiplication with the input values covered by the kernel, and summing these products to generate output feature maps. Standard convolutions can be mathematically represented in the following equation.

Where c_in and c_out are indices representing the input and output channels, respectively, while b is the bias. PyTorch enables the customisation of convolutional layers. This customisation includes controlling the dimensions of both the input matrix, represented as (N, C_in, H, W), and the output matrix, represented as (N, C_out, H, W), where N is the batch size, C denotes the number of channels, and H and W are the height and width of the input channels in pixels, respectively. The standard convolution operation is visually represented in the figure below.

Distinct from standard convolutional layers, which convolve across both input channels and spatial dimensions, depthwise convolution applies the corresponding channel of the kernel to each input channel independently, resulting in an output with the same number of channels as the input. This operation is represented in the following equation:

In the figure below a depthwise convolution is visualized.

From Depthwise convolution to Elementwise matrix multiplication

In order to have this equation represent an element-wise matrix multiplication, we first set the bias b(c_out) = 0. In addition, we convert the width and height of both the input matrix and the kernel into 1 x 1 matrices and reshape the elements of these two matrices into vectors along the channel axis where C_out = H * W, resulting in the following equation:

As the equation shows, we now have implemented an elementwise matrix multiplication where the first matrix is equal to Input and the second matrix to the Kernel. A visualization of this operation based on the depthwise convolution can be seen below.

Set up of the environment

We have used the AMD Ryzen AI Sw stack v1.2 as our basis, the steps needed for installing this software stack on a Minisforum UM790 pro are detailed by goichi harade. Next, in the tutorial directory within your installed RyzenAI-SW git and install TINA using the following git command.

git clone https://github.com/ChristiaanBoe/TINA.git.

From that point onwards we can use the NPU Jupyter Notebooks within the directory NPU scripts. In addition to running the notebooks, the following libraries should be installed on your RyzenAI anaconda instance using the following command.

pip install numpy matplotlib

Explanation of the code

In this example, we start off with one of the easiest functions to map towards a convolution namely an elementwise matrix multiplication. We based our code on the Hello World tutorial seen within the RyzenAI-SW git. First, we define the TINA layer of the elementwise matrix multiplication, as can be seen in the previous couple of paragraphs, both matrices get converted into a vector. With the first matrix being the input matrix and the second matrix being mapped towards the kernel.

class ElementwiseMult(nn.Module):
  def __init__(self, matrix) -> None:
      super(ElementwiseMult, self).__init__()
      shape = matrix.shape

      self.batch, self.channels, self.height, self.width = shape

      self.conv_layer = nn.Conv2d(self.height*self.width, self.height*self.width, bias=False, kernel_size= (1,1), stride= (1, 1), groups= self.width*self.width)
      weightsconv = matrix.view(self.height*self.width, 1, 1, 1)


      self.conv_layer.weight.data = weightsconv

  def forward(self, x):
    shape = x.shape
    batch_size = shape[0]


    input = x.view(batch_size, self.height*self.width, 1, 1)

    out = self.conv_layer(input)

    return out.view(shape)

In order to test whether our code has been defined well, we initialize our matrix multiplication using the following code.

int size = 5
mat_one =  torch.from_numpy(np.random.rand(1,1,size,size))
mat_two =  torch.from_numpy(np.random.rand(1,size, 1,size)).float()
MAT_layer = ElementwiseMult(mat_one )
MAT_layer = MAT_layer .float()int size = 5

As seen in the following code we initialised two random matrices and initialised an instance of the ElementwiseMult. Now let's check if it runs correctly.

outputnormal = mat_one * mat_two
outputTINA =  MAT_layer (mat_two )
print(outputnormal )
print(outputTINA )

Perfect! Our outputs are the same. Now the first step to make our elementwise matrix multiplication is to convert it towards the ONNX format:

tmp_model_path = "models/ElementMat.onnx"
torch.onnx.export(
MAT_layer,                     # model being run
mat_two,                 # model input (or a tuple for multiple inputs)
tmp_model_path,                     # where to save the model
export_params=True,            # store the trained parameter weights inside the model file
opset_version=13,              # the ONNX version to export the model to
input_names=['input'],         # the model's input names
output_names=['output'],       # the model's output names
dynamic_axes={'input': {0: 'batch_size'}, 'output': {0: 'batch_size'}}  # variable length

What can be seen is that the reshape causes a large number of nodes, it would be a good experiment to see if an ONNX optimiser could not only make the ONNX architecture more simplified but also decrease the runtime. Now for our final step, we need to quantize our model to INT8 using the Vitis AI Quantizer.

# `input_model_path` is the path to the original, unquantized ONNX model.
input_model_path = "models/ElementMat.onnx"
# `output_model_path` is the path where the quantized model will be saved.
output_model_path = "models/ElementMat_quantized.onnx"
vai_q_onnx.quantize_static(
input_model_path,
output_model_path,
calibration_data_reader=None,
quant_format=vai_q_onnx.QuantFormat.QDQ,
calibrate_method=vai_q_onnx.PowerOfTwoMethod.MinMSE,
activation_type=vai_q_onnx.QuantType.QUInt8,
weight_type=vai_q_onnx.QuantType.QInt8,
enable_ipu_cnn=True,
extra_options={'ActivationSymmetric': True}

Finally to run it

# Point to the config file path used for the VitisAI Execution Provider
config_file_path = "vaip_config.json"
#aie_options = onnxruntime.SessionOptions()
#aie_options.enable_profiling = True
aie_session = onnxruntime.InferenceSession(
"models/ElementMat_quantized.onnx",
providers = ['VitisAIExecutionProvider'],
sess_options=aie_options,
provider_options=[{'config_file': config_file_path}]
)
ryzen_outputs = aie_session.run(None, {'input': input_data})
start = timer()
ryzen_outputs = aie_session.run(None, {'input': input_data})
aie_total = timer() - start
#aie_session.end_profiling()

And there you have it! An elementwise matrix multiplication running on the NPU. We highly recommend checking out the other NPU notebooks found in our git: https://github.com/ChristiaanBoe/TINA/tree/main/NPU%20scripts and comparing their performance by running the code on the GPU. For ease of portability, all TINA layers are defined independently within our Git under: https://github.com/ChristiaanBoe/TINA/tree/main/TINA%20layers.

Results

In order to evaluate the capabilities of our TINA framework on the Ryzen NPU, we measured the runtime of a complex algorithm (PFB algorithm with an attached Discrete Fourier Transform(DFT)), as well as a couple of signal-processing operations (FIR algorithm and the unfolding algorithm). We compared the NPU runtime to that of the CPU executed using NumPy. In addition, we also measured the runtime of simpler linear algebra operations: an elementwise matrix multiplication and a matrix-matrix multiplication for the NPU using TINA code and compare it to its CPU NumPy equivalent. Below we show the figures with the results of these measurements.

This figure shows the speedup of the PFB with attached DFT on NPU TINA compared to the CPU NumPy equivalent. As can be seen the more complex the PFB becomes with a larger amount of Taps and Branches the higher the speedup becomes.

This plot shows the runtime of the ulfolding algorithm both on the NPU using TINA as well as the CPU using NumPy. The figure shows that as the size of the input vector increases so does the speedup of the NPU as compared to the CPU.

This plot shows the runtime of an FIR algorithm both on the NPU using TINA as well as the CPU using NumPy. The figure shows that as the size of the input vector increases so does the speedup of the NPU as compared to the CPU.

This plot shows the runtime of the elementwise matrix multiplication both on the NPU using TINA as well as the CPU using NumPy. The figure shows that as the size of the input matrices increases so does the speedup of the CPU as compared to the NPU.

The PFB figure shows an increase in speedup of circa 9.5x compared to the Ryzen CPU NumPy implementation of our most computationally complex operation. This is likely attributed to the efficient handling of data-independent loops by TINA in comparison to NumPy. For the individual signal processing and linear algebra operations, we see that we can offload the work to the NPU, and in general, see a speedup across all measured operations (compared to Numpy in the graphs) except for the elementwise matrix multiplication.

Secondly, we show that with minimal effort compared to our original code, we could execute non-NN code on AMD's Ryzen NPU. This was impossible before the release of Peano about a month ago. As of the time of writing, using Peano still requires users to rewrite their non-NN algorithms into Peano’s low level language equivalent. In contrast, TINA allows running a non-NN algorithm on the NPU with just two extra operations or approximately 20 lines of added code.

Third, a major strength we have identified by using the Ryzen NPU (in comparison to other HW platforms we used) is that it allows us to process operations with far larger matrices. Compared to previous benchmarks we have run on GPUs [9], CUDA memory is a limiting factor despite the fact that we used a server-class GPU (Tesla T4) with 16 GB of memory. This limitation is alleviated on the AMD Ryzen NPU since it shares the same memory with the CPU providing up to 64GB of memory that greatly increases its capacity to process larger applications. For future work, we intend to identify optimization opportunities in our framework which will require a much more detailed insight into the hardware and the run-time profiling of the application. This will further increase the performance of TINA on the Ryzen NPU and provide new insights by running our applications on the newly released Ryzen AI-SW version 1.2.

Citations

[1] Harris, Fredric J., Multirate Signal Processing for Communication Systems, 2021, [Book], chapter 9 Polyphase Channelizers, pages 241-273

[2] Harris, F.J. and Dick, C. and Rice, M., Digital receivers and transmitters using polyphase filter banks for wireless communications, 2003, 10.1109/TMTT.2003.809176

[3] Pfau, Johannes and Figuli, Shalina Percy Delicia and Bähr, Steffen and Becker, Jürgen, Reconfigurable FPGA-Based Channelization Using Polyphase Filter Banks for Quantum Computing Systems, 2018, 10.1007/978-3-319-78890-6_49

[4] Kim, Taewan and Lee, Choong and Kim, Jung-jun and Song, Tai-Kyong, A New Dynamic Decimation Filter Using Polyphase MACs for Medical Ultrasound Imaging, 2008, 10.1109/ULTSYM.2008.0338

[5] Zhou, Daniel, A Review of Polyphase Filter Banks and Their Application, 2006

[6] Price, Danny C, Spectrometers and Polyphase Filterbanks in Radio Astronomy, 2021, [Book], Chapter 1, pages 159-179

[7] van Haarlem, M. P. et al., LOFAR: The LOw-Frequency ARray, 2013, 10.1051/0004-6361/201220873

[8] van Wijhe, Victor and Sprave, Vincent and Passaretti, Daniele and Alachiotis, Nikolaos and Grutzeck, Gerrit and Pionteck, Thilo and van der Vlugt, Steven, Exploring the Versal AI Engines for Signal Processing in Radio Astronomy, 2024, to appear at FPL in September

[9] Christiaan Boerkamp, Steven van der Vlugt, Zaid Al-Ars, "TINA: Acceleration of Non-NN Signal Processing Algorithms Using NN Accelerators", Int’l Workshop on Machine Learning for Signal Processing, 2024, UK.