Deep learning is ubiquitous. This project sought to accelerate Deep Learning inference on FPGA hardware.
In a previous blog post, AMD-Xilinx's Vitis AI deep learning framework was already evaluated. It showed that Vitis AI can be used to easily implement large neural networks on FPGAs. The performance is above compared to other embedded devices. But especially for small networks, the overhead comes into play and the performance is quite poor.
In this projcet, a small, fully connected neural network is accelerated using Vitis High-Level Synthesis (HLS) on FPGA hardware.
HLS is used to accelerate the design process compared to traditional RTL design techniques using a hardware definition language such as VHDL or Verilog.
The SoM Kria KV260 in combination with PYNQ is used for this project.
FundamentalsAn ANN consists of many neurons that are interconnected in layers. Figure 1 shows the structure of such a network.
The output a_i(l) of the neuron I in layer l is illustrated in Figure 2.
The output can be numerically computed with
where n_{l−1} are the number of nodes of the previous layer. This can be written in matrix notation, with
where the weights are arranged as follows
The activation function is applied to each neuron separately
Quantization converts a model, which is typically trained in 32-bit float accuracy to a lower precision such as 8-bit integer or 16-bit float. For general computing architectures the time it takes to compute for example 8-bit integer multiplication is significantly lower as a 32-bit
As we can see in figure 3, all the inputs are either 8-bit or 32-bit integers and the output is an 8-bit integer. After the multiply and accumulate operation in the dense layer the output is quantized from 32-bit to 8-bit. This is done by a scaling factor.
The used network:
Layer (type) Output Shape Param #
================================================================
dense_0 (Dense, ReLU) (None, 128) 100480
dense_1 (Dense, ReLU) (None, 256) 33024
dense_2 (Dense, ReLU) (None, 10) 2570
================================================================
Total params: 136,074
Trainable params: 136,074
Non-trainable params: 0
________________________________________________________________
The Python code to compute inference of this network looks like:
The deep learning framework TensorFlow is used for training the neural network. AMD-Xilinx Vitis AI is used for the FPGA-optimized quantization of the deep learning framework.
The difference between FPGA-optimized quantization and typical integer quantization for μControllers is the scaling factor before the output. In the paper on the quantization technique of the TensorFlow Lite framework, the scaling factor is described as fixed-point multiplication. However, in FPGAs, this factor takes the form of division by a power of two (2^n). This can be implemented in hardware as a very cheap bit shift operation.
The dump_model()
function in Vitis AI is used to output the quantized weights and biases. It also outputs the output of the mesh for a given input. With this information, the scaling factors can be calculated. Internally, there is more than one scaling factor in the inference implementation of Vitis AI (after Dense, BiasAdd and ReLU). Unfortunately, there is not enough data provides by Vitis AI to calculate them all. However, for this example, a general factor after the activation function is sufficient and no significant loss of accuracy is observed.
To write the weights and biases into a C++ constant a Python script has been written.
High Level SynthesisAs seen in the fundamentals part of this post, the inference of a fully connected neural network is a vector-matrix multiplication with the ReLU activation function and the scaling factor.
The basics for the implementation of the matrix multiplication are from a great HLS tutorial (YouTube, GitHub) from the Scalable Parallel Computing Laboratory at ETH Zurich.
General Matrix Multiplication
First the general matrix multiplication directly implemented with three loops. With this structure the synthesis tool will not be able to pipeline the innermost loop. Because with every loop iteration of the innermost loop the code tries to read and write to the acc
variable. This can on hardware not be done in one clock cycle.
This loop carried dependence can be resolved with a simple iteration space reordering. Now we first read from arr[m]
and after write to arr[m]
. So no loop carried dependence is generated.
Now we want some parallelism, there are two dimension we can unroll. Firstly we can use multiple (D) rows of the A matrix an put it in a buffer an use it parallel. so we can cut the number of iteration of the outer loop by the factor of D. Then we can also process vectors in order to cut the number of iterations of the inner loop by the factor of M. In order to vectorize the system a vector type Vec_t
must be introduced.
To obtain a general dense layer function, several modifications are required. The goal is to obtain a fully parallel and pipeline HLS function.
Variable dimensions are generally problematic for a hardware implementation, one has to imagine how a synthesis tool wants to generate hardware for a loop with variable iterations or how to instantiate registers for a array with variable length. But with the design in a dense layer with only vector-matrix multiplications, it is possible. A fully unrolled vector-matrix multiplication means in this case:
- N = 1, and therefore D = 1
- K = number of nodes from the previous layer (or input)
- M = number of nodes of the current layer
- W = M if fully parallel
With the pragma HLS LOOP_TRIPCOUNT
one can help the tool to synthesize unconstrained loops. With the vector-matrix multiplication most of them are only executed ones. Other modifications:
- The scale factor is implemented as a bit-shift operation
- The ReLU activation function is implemented as a ternary operation
The Top function of the network is composed of the input/output specification and is basically a function call for every layer in the network. The Macro __SYNTHESIS__
can be used to isolate code which will not be synthesized. For eliminate the overhead for copying every image separately to the FPGA all 10000 test images are loaded to the programmable logic.
Because Vec_t is of fixed lenght 256 the weights matrices for layer 1 (784x128) and layer 3 (256x10) must be padded to a width of 256.
BandwidthWith this configuration Vitis HLS gets to its limit with the internal bandwidth. The original Vitis AI quantization after the multiply accumulate operation in the dense layer is 32 bit integer. With a vector width of 256 and 32-bit integer a total vector width of 8192 bit is generated, which is quite a lot. Therefore, the quantization factors were chosen so that a 16-bit integer precision can be used.
Board SetupThis part follows the GitHub repository Xilinx/Kria-Pynq.
- Download the Ubuntu Image (We used Ubuntu Desktop 20.04.3 LTS)
- Flash Ubuntu Image to a SD-Card with balenaEtcher
- Insert the microSD card containing the boot image in the microSD card slot (J11) on the Starter Kit.
- Connect the micro USB end to J4 on the Starter Kit.
- Connect the Ethernet cable for required internet access.
- Grab the Power Supply and connect it to the DC Jack (J12) on the Starter Kit.
- Boot the FPGA
- Connect to FPGA with a serial terminal program of your choice (First of the two Serial Ports)
- User name: ubuntu, password: ubuntu (user must change password after first boot)
- Baud rate = 115200
- Data bits = 8
- Stop bits = 1
- Flow control = None
- Parity = None
- Run following commands to install the environment
sudo snap install xlnx-config --classic --channel=1.x
xlnx-config.sysinit
git clone https://github.com/Xilinx/Kria-PYNQ.git
cd Kria-PYNQ/
sudo bash install.sh
This can take a while, around 30 min. After one can connect to the JupyterLab via web browser with kria:9090/lab
the password is xilinx.
In the Jupyter environment open a terminal run following commands:
cd home/root/jupyter_notebooks
git clone https://github.com/Nunigan/MNIST_HLS.git
cd MNIST_HLS
In the folder MNIST HLS are all the pre-built data you need.
Vitis HLSClone the GitHub repository with all the code to your host PC where Vitis HLS and Vivado are installed. Prebuilt data is also included in the repo.
- Create A Vitis Project
- Add the
src/MLP.cppsrc/MLP.hsrc/mnist.hsrc/Datapack.h
to the source files - Add
src/MLP_tb.cpp
to the testbench files. - Select the
xck26-sfvc784-2LV-c
as board - Use Vivado IP flow target.
- Under Project setting > Synthesis choose MultilayerPerceptron as top function
- Run C simulation and check the console for errors
If the proposed network can be fully parallelized the theoretical number of cycles to compute inference would be: 784+128+256 = 1168.
If the synthesis is run a report somewhat like to following should be generated. One can see the generated RTL code needs 1287 clock cycles to compute the inference of one image. The difference to the theoretical minimum is due to the overhead for the function calls and data transfer.
The RTL can then be exported to a designated location.
Create a new vivado project and select the Kria KV 260 board. Add the HLS IP block under Settings/IP/Repository and add the path to the exported and unpacked IP block by Vitis HLS. Create a block design with a Zynq Ultrascale+ and the HLS block. Run Block automation.
Set the clock frequency for the HLS block to 250 MHz.
Select a AXi high speed slave interface with the Data width of 64 bit
Run block connection automation to connect everything.
After that, run the bitstream generation. Export the bitstream, the block design (block design must be open for that option) and copy the hardware handoff from this location:
*.gen/sources_1/bd/design_1/hw_handoff/design_1.hwh
Those three files are needed to run with pynq.
Pynq- connect to
kria:9090/lab
over a browser - Create a folder and add the
pynq/
folder from the GitHub repository
The notebook mnist.ipynb shows how easily a HLS block can be tested:
Load the bit file:
Overclock the system, as much as the results are still correct.
Print the register map. The name MultilayerPerceptron_0 can be seen under the block properties in Vivado.
Allocate Bufferspace for the input and output AXI-bus
Load all the necessary data, weights and biases are only needed for the software implementation.
The function for the hardware accelerated MNIST inference.
The function for the software implementation
Check if both functions generate the same output
Compute the performance gain
And the FPS
The notebook dpu_mnist.ipynb
runs the same fully connected neural network mnist.xmodel
on the DPU. The performance is about 4000 fps.
A quick comparison with my VHDL implementation of the same neural network. You can clearly see that HLS has a lower hardware utilization. In one point the HDL implementation is better than HLS. HLS does not reuse the DSP's for the different layers. It has DSP's for each layer placed separately (128+256+10=394), while my VHDL implementation uses the same DSP's for each layer.
This Project shows how simple neural networks can be accelerated with Vitis HLS. This only shows the implementation for a small network. For larger network adjustments to the code needs to be done.
If one wants to recreate the quantization and file generation the python script src/uitils.py
provides everything. One needs Vitis AI installed on the host PC.
Pynq code is inspired by zst123' blog post about SHA256 Crypto Accelerator with PYNQ & Vitis HLS
Comments
Please log in or sign up to comment.