Published July 22, 2022 © Apache-2.0

FPGA Deep Learning Inference | HLS | Kria KV260 | Pynq

Running a simple ANN on a FPGA for MNIST with more than 200'000 fps. Using the Kria KV260 FPGA, HLS and Pynq

IntermediateFull instructions provided3 hours5,552

FPGA Deep Learning Inference | HLS | Kria KV260 | Pynq

Things used in this project

Hardware components

AMD Kria KV260 Vision AI Starter Kit

AMD-Xilinx - Kria KV260 Basic Accessory Pack

Software apps and online services

AMD Vivado Design Suite

AMD PYNQ Framework

AMD-Xilinx - Vitis HLS

Story

Introduction

Deep learning is ubiquitous. This project sought to accelerate Deep Learning inference on FPGA hardware.

In a previous blog post, AMD-Xilinx's Vitis AI deep learning framework was already evaluated. It showed that Vitis AI can be used to easily implement large neural networks on FPGAs. The performance is above compared to other embedded devices. But especially for small networks, the overhead comes into play and the performance is quite poor.

In this projcet, a small, fully connected neural network is accelerated using Vitis High-Level Synthesis (HLS) on FPGA hardware.

HLS is used to accelerate the design process compared to traditional RTL design techniques using a hardware definition language such as VHDL or Verilog.

The SoM Kria KV260 in combination with PYNQ is used for this project.

Fundamentals

An ANN consists of many neurons that are interconnected in layers. Figure 1 shows the structure of such a network.

Figure 1: Graph of a ANN with L layers, which is also called Fully Connected Neural Network because each output of a neuron is connected with every neuron of the next layer. The first layer with n_1 nodes is connected to the hidden layers with n_l nodes. The last layer with n_L nodes provides the output.

The output a_i(l) of the neuron I in layer l is illustrated in Figure 2.

Figure 2: Graph of the i^{th} neuron i in layer l of an ANN. The inputs a_{i}(l − 1) are from the output from the previous layer (l−1), the weights w and the bias b are corresponding to the layer l, where σ(·) is an activation function

The output can be numerically computed with

where n_{l−1} are the number of nodes of the previous layer. This can be written in matrix notation, with

where the weights are arranged as follows

The activation function is applied to each neuron separately

Quantization converts a model, which is typically trained in 32-bit float accuracy to a lower precision such as 8-bit integer or 16-bit float. For general computing architectures the time it takes to compute for example 8-bit integer multiplication is significantly lower as a 32-bit

Figure 3: Integer-arithmetic-only inference. Dense stands for a fully connected node in typical AI terminology

As we can see in figure 3, all the inputs are either 8-bit or 32-bit integers and the output is an 8-bit integer. After the multiply and accumulate operation in the dense layer the output is quantized from 32-bit to 8-bit. This is done by a scaling factor.

The used network:

Layer (type)                Output Shape              Param #   
================================================================
 dense_0 (Dense, ReLU)      (None, 128)               100480    
                                                              
 dense_1 (Dense, ReLU)      (None, 256)               33024     
                                                                
 dense_2 (Dense, ReLU)      (None, 10)                2570      
                                                                 
================================================================
Total params: 136,074
Trainable params: 136,074
Non-trainable params: 0
________________________________________________________________

The Python code to compute inference of this network looks like:

Training & Quantization

The deep learning framework TensorFlow is used for training the neural network. AMD-Xilinx Vitis AI is used for the FPGA-optimized quantization of the deep learning framework.

The difference between FPGA-optimized quantization and typical integer quantization for μControllers is the scaling factor before the output. In the paper on the quantization technique of the TensorFlow Lite framework, the scaling factor is described as fixed-point multiplication. However, in FPGAs, this factor takes the form of division by a power of two (2^n). This can be implemented in hardware as a very cheap bit shift operation.

The dump_model() function in Vitis AI is used to output the quantized weights and biases. It also outputs the output of the mesh for a given input. With this information, the scaling factors can be calculated. Internally, there is more than one scaling factor in the inference implementation of Vitis AI (after Dense, BiasAdd and ReLU). Unfortunately, there is not enough data provides by Vitis AI to calculate them all. However, for this example, a general factor after the activation function is sufficient and no significant loss of accuracy is observed.

To write the weights and biases into a C++ constant a Python script has been written.

High Level Synthesis

As seen in the fundamentals part of this post, the inference of a fully connected neural network is a vector-matrix multiplication with the ReLU activation function and the scaling factor.

The basics for the implementation of the matrix multiplication are from a great HLS tutorial (YouTube, GitHub) from the Scalable Parallel Computing Laboratory at ETH Zurich.

General Matrix Multiplication

First the general matrix multiplication directly implemented with three loops. With this structure the synthesis tool will not be able to pipeline the innermost loop. Because with every loop iteration of the innermost loop the code tries to read and write to the acc variable. This can on hardware not be done in one clock cycle.

Input Matrix A (N x K), input Matrix B(K x M), output Matrix C (N x M)

This loop carried dependence can be resolved with a simple iteration space reordering. Now we first read from arr[m] and after write to arr[m]. So no loop carried dependence is generated.

Now we want some parallelism, there are two dimension we can unroll. Firstly we can use multiple (D) rows of the A matrix an put it in a buffer an use it parallel. so we can cut the number of iteration of the outer loop by the factor of D. Then we can also process vectors in order to cut the number of iterations of the inner loop by the factor of M. In order to vectorize the system a vector type Vec_t must be introduced.

Figure 4: The original version of the matrix multiplication with fixed dimensions M, N, K and parallel parameters D, W

To obtain a general dense layer function, several modifications are required. The goal is to obtain a fully parallel and pipeline HLS function.

Variable dimensions are generally problematic for a hardware implementation, one has to imagine how a synthesis tool wants to generate hardware for a loop with variable iterations or how to instantiate registers for a array with variable length. But with the design in a dense layer with only vector-matrix multiplications, it is possible. A fully unrolled vector-matrix multiplication means in this case:

N = 1, and therefore D = 1
K = number of nodes from the previous layer (or input)
M = number of nodes of the current layer
W = M if fully parallel

With the pragma HLS LOOP_TRIPCOUNT one can help the tool to synthesize unconstrained loops. With the vector-matrix multiplication most of them are only executed ones. Other modifications:

The scale factor is implemented as a bit-shift operation
The ReLU activation function is implemented as a ternary operation

The Top function of the network is composed of the input/output specification and is basically a function call for every layer in the network. The Macro __SYNTHESIS__ can be used to isolate code which will not be synthesized. For eliminate the overhead for copying every image separately to the FPGA all 10000 test images are loaded to the programmable logic.

Because Vec_t is of fixed lenght 256 the weights matrices for layer 1 (784x128) and layer 3 (256x10) must be padded to a width of 256.

Bandwidth

With this configuration Vitis HLS gets to its limit with the internal bandwidth. The original Vitis AI quantization after the multiply accumulate operation in the dense layer is 32 bit integer. With a vector width of 256 and 32-bit integer a total vector width of 8192 bit is generated, which is quite a lot. Therefore, the quantization factors were chosen so that a 16-bit integer precision can be used.

Board Setup

This part follows the GitHub repository Xilinx/Kria-Pynq.

Download the Ubuntu Image (We used Ubuntu Desktop 20.04.3 LTS)
Flash Ubuntu Image to a SD-Card with balenaEtcher
Insert the microSD card containing the boot image in the microSD card slot (J11) on the Starter Kit.
Connect the micro USB end to J4 on the Starter Kit.
Connect the Ethernet cable for required internet access.
Grab the Power Supply and connect it to the DC Jack (J12) on the Starter Kit.
Boot the FPGA
Connect to FPGA with a serial terminal program of your choice (First of the two Serial Ports)
User name: ubuntu, password: ubuntu (user must change password after first boot)
Baud rate = 115200
Data bits = 8
Stop bits = 1
Flow control = None
Parity = None
Run following commands to install the environment

sudo snap install xlnx-config --classic --channel=1.x
xlnx-config.sysinit
git clone https://github.com/Xilinx/Kria-PYNQ.git
cd Kria-PYNQ/
sudo bash install.sh

This can take a while, around 30 min. After one can connect to the JupyterLab via web browser with kria:9090/lab the password is xilinx.

In the Jupyter environment open a terminal run following commands:

cd home/root/jupyter_notebooks

git clone https://github.com/Nunigan/MNIST_HLS.git

cd MNIST_HLS

In the folder MNIST HLS are all the pre-built data you need.

Vitis HLS

Clone the GitHub repository with all the code to your host PC where Vitis HLS and Vivado are installed. Prebuilt data is also included in the repo.

Create A Vitis Project
Add the src/MLP.cppsrc/MLP.hsrc/mnist.hsrc/Datapack.h to the source files
Add src/MLP_tb.cpp to the testbench files.
Select the xck26-sfvc784-2LV-c as board
Use Vivado IP flow target.
Under Project setting > Synthesis choose MultilayerPerceptron as top function
Run C simulation and check the console for errors

If the proposed network can be fully parallelized the theoretical number of cycles to compute inference would be: 784+128+256 = 1168.

If the synthesis is run a report somewhat like to following should be generated. One can see the generated RTL code needs 1287 clock cycles to compute the inference of one image. The difference to the theoretical minimum is due to the overhead for the function calls and data transfer.

The RTL can then be exported to a designated location.

Vivado

Create a new vivado project and select the Kria KV 260 board. Add the HLS IP block under Settings/IP/Repository and add the path to the exported and unpacked IP block by Vitis HLS. Create a block design with a Zynq Ultrascale+ and the HLS block. Run Block automation.

Set the clock frequency for the HLS block to 250 MHz.

Select a AXi high speed slave interface with the Data width of 64 bit

Run block connection automation to connect everything.

After that, run the bitstream generation. Export the bitstream, the block design (block design must be open for that option) and copy the hardware handoff from this location:

*.gen/sources_1/bd/design_1/hw_handoff/design_1.hwh

Those three files are needed to run with pynq.

Pynq

connect to kria:9090/lab over a browser
Create a folder and add the pynq/ folder from the GitHub repository

The notebook mnist.ipynb shows how easily a HLS block can be tested:

Load the bit file:

Overclock the system, as much as the results are still correct.

Print the register map. The name MultilayerPerceptron_0 can be seen under the block properties in Vivado.

Allocate Bufferspace for the input and output AXI-bus

Load all the necessary data, weights and biases are only needed for the software implementation.

The function for the hardware accelerated MNIST inference.

The function for the software implementation

Check if both functions generate the same output

Compute the performance gain

And the FPS

Vitis AI

The notebook dpu_mnist.ipynb runs the same fully connected neural network mnist.xmodel on the DPU. The performance is about 4000 fps.

Hardware Utilization

A quick comparison with my VHDL implementation of the same neural network. You can clearly see that HLS has a lower hardware utilization. In one point the HDL implementation is better than HLS. HLS does not reuse the DSP's for the different layers. It has DSP's for each layer placed separately (128+256+10=394), while my VHDL implementation uses the same DSP's for each layer.

Hardware utilization HLS

Hardware utilization VHDL

Conclusion

This Project shows how simple neural networks can be accelerated with Vitis HLS. This only shows the implementation for a small network. For larger network adjustments to the code needs to be done.

If one wants to recreate the quantization and file generation the python script src/uitils.py provides everything. One needs Vitis AI installed on the host PC.