Introduction
Getting Started
Section 1: Training a custom BNN using Theano + Binarynet
Section 2. Weight/threshold packing
Section 3. HLS & Vivado Synthesis
Section 4. SFC PYNQ Integration
Section 5. Testing the SFC on the Zybo-Z7

Published March 21, 2020 © GPL3+

BNN-PYNQ: Baking a custom BNN for the Zybo-Z7

Training and synthesizing a BNN for a very small Zynq SoC.

IntermediateFull instructions provided4 hours3,808

BNN-PYNQ: Baking a custom BNN for the Zybo-Z7

Things used in this project

Hardware components

Digilent Zybo Z7-10

Software apps and online services

AMD Vivado Design Suite

Xilinx PYNQ

Story

Introduction

PYNQ is a linux-based OS for Xilinx 7 series SoC that provides a Python framework to integrate FPGA IP based solutions. BNN-PYNQ is a repository containing several hardware-based Binary Neural Networks (BNN) examples that run on PYNQ achieving a high frame rate.

Unfortunately, this repository is targetted to the bigger PYNQ-Z1/Z2 board, not being compatible with the small FPGA contained in the Zybo/Zybo-Z7-10.

Since this board is quite cheap and popular, we will leverage the BNN-PYNQ framework to train and synthesize a smaller, binary neural network model in the form of a PYNQ overlay (gateware + extra config files) that can fit in the PL of the ZYNQ SoC.

We will synthesize a 3-layer fully connected network with 256 neurons per layer, called SFC, that was presented in the Xilinx FINN paper but never released on BNN-PYNQ repo. This description can be used as well as a tutorial to synthesize custom BNN overlays for custom boards.

This project was conducted in collaboration with Syed Talal Wasim as part of our Tutored Research and Development Project on Neural Network inference on FPGAs for the Erasmus Joint Master Degree on Image Processing and Computer Vision.

Getting Started

To get started we will need the following:

Zybo Z7-10 PYNQ Image (I have Vivado 2018.3, so I used v2.4)
Clone of Xilinx BNN-PYNQ repository (we will use a forked version, BNN-PYNQ-ZYBO includes the smaller network files).

The extended BNN-PYNQ-ZYBO repository includes already the libraries and python wrapper ready to test the SFC in the Zybo-Z7.

NOTE: If you have a Zybo-Z7-10 and want to avoid the details on how to implement the network, you can test the BNN directly by installing the BNN-PYNQ-ZYBO package in your board and following the instructions on Section 5 of this post.

To summarize what we are going to do, we will split this guide in several parts:

Part 1: Training a custom BNN using Theano + Binarynet.
Part 2: Weight and threshold packing.
Part 3: HLS & Vivado Synthesis
Part 4: Overlay PYNQ Integration
Part 5: Testing the SFC on the Zybo-Z7

Let’s get started!

Section 1: Training a custom BNN using Theano + Binarynet.

Goal: Define the NN architecture, train for specific dataset.

The Xilinx BNN-PYNQ repository, provides Theano compatible libraries to train binarized, fully connected or convolutional neural networks. The library is based on binarynet, and it is used to train pre-binarized neural networks that can afterwards be deployed onto the FPGA.

In a nutshell, a fully binarized neural network operates using 1-bit weights and inputs. It replaces the good ol' vector inner products (multiply and accumulate) by popcount and thresholds that ultimately aim for mimic the behaviour of fully connected neurons with a tanh-ish activation function. By using the Xilinx FINN library, once deployed on the FPGA, all layers can run concurrently to achieve high classification rates.

To train a network model, in this case our small SFC - Zybo-Z7 compatible one, let’s clone the repo in our local computer and let's look for ./xtras/ directory of the repository:

// bnn_weight_test.ipynb	Utility to inspect BNN parameter
//                              distribution.
// train_SFC_colab.ipynb	SFC BNN train notebook 
//                              [For Google Colab]

Use this notebook as a baseline to generate the environment to train a custom model. Let’s take a look at the SFC network definition and training scripts.

Take a look at the ./bnn/src/training/ directory of the repo. Let’s see the relevant files here:

// lfc.py			LFC BNN architecture definition.
//                              (used as a template)
// mnist.py			LFC training script.
//                              (used as a template)
// sfc.py			SFC BNN architecture definition.
// sfc_train_mnist.py		SFC training script.
// sfc_mnist-1w-1a.npz        	Pickle with pre-trained weights
//                              for SFC - MNIST.
// mnist-gen-weights-W1A1.py    Weight/threshold packing script.
//                              (used as a template)
// sfc_mnist-gen-weights.py	Weight/threshold packing script.
// finnthesizer.py		Weight/threshold packer utility.

To define the SFC network, I modified the original BNN-PYNQ lfc.py and adjusted the number of nodes per layer from 1024 to 256. This process is shown in the next image:

The number of nodes and layers can be defined using sfc.py or lfc.py as a template.

Furthermore, I modified mnist.py into sfc_train_mnist.py to instantiate and train the SFC model. You should modify this file in case you want to train for a different dataset, other than MNIST.

Let’s copy ./xtras/train_SFC_colab.ipynb to Google Colab and run it. It will automatically source the SFC repo and correct libraries to train the model for 1000 epochs (patience/ required). The best epoch will be saved on a pickled file called sfc_mnist-1w-1a.npz (trained weights for SFC and MNIST provided in the repo).

Patience is also a form of action O.O .

Even though the trained network is a BNN, its layer representation uses floating point as it encompasses layers’ weight and batch normalization parameters. If you feel like, check the FINN paper to see how the threshold is calculated.

Not mandatory, but if you use the notebook located in ./xtras/bnn_weight_test.ipynb and inspect the pickle file you will see the weight and batchnorm parameters sequentially stored for each layer.

Pickle arrangement of layers’ properties (including output layer)

So we have the pickled layers of our BNN. In the next step we will pack them into an arrangement of actual binary weights and integer thresholds.

Section 2. Weight/threshold packing

Goal: Define the folding parameters of the network. Get the discretized weights. Synthesize the MVTUs config.h file. Set the main.cpp for HLS synthesis.

All BNN-PYNQ implementations presented in the repo leverage the FINN Library, that infers a special kind of binary processor called Matrix-Vector-Threshold unit (MVTU). There is one MVTU per layer working in a dataflow scheme, so that all layers can be computed concurrently.

In this step we will define the actual sizes of the MVTUs, and run the weight/threshold packing script in our computer, while having into consideration the forward propagation latency. You need to have into account that:

A bigger MVTU computes layers' calculations in less clock cycles but demands more FPGA computing resources.
All the thresholds and weights as well as the activations between layers are stored in on-chip block RAMs so that they can be accessed very quickly. There’s no way you can optimize this approach. Even with small MVTUs inferred, all parameters have to be loaded beforehand into the IP.
Smaller MVTUs have a small memory overhead since more intermediate results need to be stored.

For a fully connected layer, the MVTU’s width and height are defined through the simdCounts and peCounts parameters respectively, which are displayed in the next image. You can access./bnn/src/training/sfc_mnist-gen-weights.py and check out their configuration.

Using this file you can modify the MVTUs physical parameters, that ultimately build up our hardware-based BNN. This is what is called design space exploration of a FPGA architecture. Sounds fancy!

Weight and threshold packing script.

Let’s analyze a few lines of this script. The npzFile string contains the path to the pickle file you generated, or the one provided with the repo. Also, output path should match your NN design name (sfcw1a1 in my case).

Now, let’s see the next few lines. They allow us to modify the physical parameters of the BNN. Every variable is a 4 element vector, that defines specific properties of each layer (including output layer).

The parameters simdCounts and peCounts define the physical parameters of the MVTUs. To avoid padding, each value should be a divisor of the number of nodes of each network.

Finally, you can play with the input, weights and activation precisions of our network.

Ok, let’s run the script in our computer. You have to transfer the generated pickle file to your local ./bnn/src/training/ and execute it in Python2. You will see that the weight packer outputs the layers’ characteristics and generates a sfcW1A1 directory in ./bnn/src/training/.

You can choose also to run the packer script in Colab, but you will have to transfer the whole generated sfcW1A1directory to ./bnn/src/training/.

Running the weight packer on my computer.

Let’s check out the new directory ./bnn/src/training/sfcW1A1. You can see the thresholds and weights packed in a format that is specific to the network you are creating. At some point, you will have to copy those weights to the Zybo Z7. Sooner, you will use them to perform the C testbench of our BNN before HLS synthesis.

The ./bnn/src/training/sfcW1A1/hw directory contains a crucial file called config.h. This file is the spatial configuration of our BNN and among other parameters, informs the latency of each layer. You have to balance simdCounts and peCounts in the packing script so that each layer has comparable Latency.

// Fully-Connected Layer L0:
//     MatW =   832 MatH =   256
//     SIMD =    32  PE  =    16
//     WMEM =   416 TMEM =    16
//     //Ops  = 425984   Ext Latency  =   416

// Fully-Connected Layer L1:
//     MatW =   256 MatH =   256
//     SIMD =     8  PE  =    16
//     WMEM =   512 TMEM =    16
//     //Ops  = 131072   Ext Latency  =   512

// Fully-Connected Layer L2:
//     MatW =   256 MatH =   256
//     SIMD =    16  PE  =     8
//     WMEM =   512 TMEM =    32
//     //Ops  = 131072   Ext Latency  =   512

// Fully-Connected Layer L3:
//     MatW =   256 MatH =    64
//     SIMD =     4  PE  =     8
//     WMEM =   512 TMEM =     8
//     //Ops  = 32768   Ext Latency  =   512

// Extracted config.h summary from SFC.
// Note that the Latency of each layer is comparable.

Let's look the layer properties. A number of 784 entries (28x28 for MNIST character) in MatW are expected to be wired to the input layer L0 of the neural network. Due the AXI interface, this number is padded to a multiple of 64, making it 832. For the same reason, the L3 output layer's MatH is also padded from 10 to 64.

Now that we have the BNN architecture configuration and the weights packed, we need to make two new directories in the repo. (They will be already there if we are just synthesizing SFC).

// sfcW1A1 		in bnn/src/network/
// sfcW1A1		in bnn/params/mnist/

We will copy some files to those directories:

// ./bnn/src/training/sfcW1A1/hw/config.h to 
// ./bnn/src/network/sfcW1A1/hw/

// Then copy from LFC network:
// ./bnn/params/mnist/lfcw1a1/hw/top.cpp to 
// ./bnn/src/network/sfcW1A1/hw/

The file top.cpp defines the top-level HLS module to synthesize.

// Also from LFC network:
// ./bnn/params/mnist/lfcw1a1/sw/main_python.cpp to
// ./bnn/src/network/sfcW1A1/sw/.

The file main_python.cpp defines the helper library you need to compile for the overlay to interact with PYNQ.

NOTE: If you are using a three layer fully connected network (like SFC), then there’s no need to modify top.cpp nor main_python.cpp. If the layer arrangement is different, those changes need to be accounted for in these two files.

Finally, copy all packed weight and threshold files so that they are available for the HLS synthesizer script.

// Copy weights and thresholds.
// ./bnn/src/training/sfcW1A1/*.bin to ./bnn/params/mnist/sfcW1A1/
// ./bnn/src/training/sfcW1A1/*.txt to ./bnn/params/mnist/sfcW1A1/

It seems we are done over here! We have packed the weights and generated the config.h physical parameter file of our BNN. Next step is to prepare our design for HLS synthesis.

Section 3. HLS & Vivado Synthesis.

Goal: Write the.tcl files to target our board. Run synthesisto get thePYNQoverlay.

The process of HLS and Vivado synthesis is done by running a utility called make-hw.sh that is located in the repo's directory ./bnn/src/network/.

To synthesize the SFC overlay files for the Zybo-Z7, you just need to run two commands on bash.

# Run Vivado settings64.sh.
# Use here your Vivado installation path. In my case this is:

source /tools/Xilinx/Vivado/2018.3/settings64.sh

# Call the synthesis script (“a” option does HLS & Vivado Synthesis).

./make-hw.sh sfcW1A1 Zybo-Z7 a

The overlay and associated files will be created and copied to./bnn/src/network/output/bitstream/.

However, if you are training a different neural network, you will need to modify ./bnn/src/network/make-hw.sh for it to acknowledge a different test set for the testbench result before HLS synthesis. In the script you can see that a C simulation of our NN is run using a MNIST picture of the character 3.

If you are not targetting the Zybo-Z7, you need make-hw.sh to instantiate the correct part of our board here as well.

make-hw.sh: Plattform part and C testbench settings defined for SFC and Zybo-Z7.

If you are testing a different network with an already supported board, then you are done. If you need to support a different board, you will need to modify more files.

Let’s create a directory called Zybo-Z7 in bnn/src/library/script/. You need to put three files inside.

// make-vivado-proj.tcl	    Vivado project script maker
// Zybo-Z7.tcl		    Standard PYNQ configuration loader.
// Zybo-Z7.xdc		    General .xdc file for the Zybo Z7. Sourced from	//                          https://github.com/Digilent/digilent-xdc

For the file make-vivado-proj.tcl, I copied the original file for the Pynq-Z1 board from ./bnn/src/library/script/pynqZ1-Z2/ and modified the lines that refer to the FPGA part and the board name.
As for the Zybo-Z7.tcl (you should name it your_board_name.tcl), I used the file /bnn/src/library/script/pynqZ1-Z2/pynqZ1-Z2.tcl as a template, removing every original CONFIG line from the original file and replacing them for the ones present in the configuration of a standalone Vivado project targetted to my board. You can use the .tcl configuration you used to build your PYNQ distribution.

The main risk here is that the Pynq-Z1 and Zybo-Z7 boards have different RAM and AXI port mapping, so the overlay will not work even if it does synthesize with a bad configuration.

Zybo-Z7.tcl: I kept the tcl process definitions but replaced all of the CONFIG lines.

Having created those files, you are ready to proceed with the synthesis for your custom board. Now, make-hw.sh can be executed.

Running C simulation previous to HLS synthesis.

Take a look at the HLS synthesis report. The memory requirements of SFC fit the Zybo-Z7 ! This is our primary constraint. The amount of LUTs should not be taken into consideration.

HLS synthesis report.

To fit the design in the PL, a heavy synthesis and place & route optimization is done in Vivado, as you can see in the next report extract.

// +----------------------------+-------+-------+-----------+-------+
// |          Site Type         |  Used | Fixed | Available | Util% |
// +----------------------------+-------+-------+-----------+-------+
// | Slice LUTs                 |  8174 |     0 |     17600 | 46.44 |
// | Slice Registers            | 13126 |     0 |     35200 | 37.29 |
// | F7 Muxes                   |   194 |     0 |      8800 |  2.20 |
// | F8 Muxes                   |   16  |     0 |      4400 |  0.36 |
// | Block RAM Tile             |   30  |     0 |        60 | 50.00 |
// | DSPs                       |   6   |     0 |        80 |  7.50 |
// +----------------------------+-------+-------+-----------+-------+

After synthesis, both HLS and Vivado synthesis reports are available in ./bnn/src/network/output/report/sfcW1A1-Zybo-Z7/.

The overlay's bitstream and related files (.tcl and.hwh) can be found in /bnn/src/network/output/bitstream/. Let's list on this directory:

// sfcW1A1-Zybo-Z7.bit        PL Bitstream.
// sfcW1A1-Zybo-Z7.hwh        Hardware definition file.
// sfcW1A1-Zybo-Z7.tcl        Hardware configuration file.

All three files conform the PYNQ overlay for our SFC model. Now, you can proceed with the PYNQ integration step.

Section 4. SFC PYNQ Integration

Goal: Integrate the SFC model and the Zybo-Z7 plattform to the BNN-PYNQ module to perform inference.

Firstly, an installation of the bnn package is required. Follow the instructions in the BNN-PYNQ repository to install the python package on my Zybo-Z7 running PYNQ linux distro. Afterwards, some modifications had to be done for the package to support my board.

Locate my BNN-PYNQ python installation directory and got to the ./bnn/src/network directory. On my board running PYNQ 2.4, the bnn package directory is: /usr/local/lib/python3.6/dist-packages/bnn/.

Copy the three SFC overlay files from Section3 to my Zybo-Z7, directory./bnn/bitstreams/Zybo-Z7/ using scp.

Overlay loaded on Zybo-Z7.

Also, copy the packed weights and thresholds from Section2 to the Zybo-Z7, directory ./bnn/params/mnist/sfcW1A1.

Weights and thresholds loaded on the Zybo-Z7.

To integrate the bnn python wrapper to the Zybo-Z7 board, the file ./bnn/bnn.py and ./bnn/__init__.py need to be modified. Allow bnn.py to recognize your board and network. Also, remember to add a reference to your network in __init__.py as an available parameter ( NETWORK_SFCW1A1 in my case ).

bnn.py: Integrating the SFC and the Zybo-Z7 board.

One more step to go! The helper library needs to be compiled on your board. You need to transfer the files main_python.cpp and config.h that define the characteristics of our NN model into PYNQ to compile the helper library.

To do so, copy the directories hw and sw you created in Section 2 of this post to the ./bnn/src/network/sfcW1A1 directory of the Zybo-Z7.

Before the library can be compiled, you should modify the file make-sw.sh present in./bnn/src/network/ to account for the Zybo-Z7.

make-sw.sh: Modified to accept the Zybo-Z7.

Afterwards, ssh to the Zybo-Z7, reach./bnn/src/network and run:

./make-sw sfcW1A1 python_hw

After completion, the python_hw-sfcW1A1-Zybo-Z7.so library gets compiled on the board at ./bnn/src/network/output/sw/. Move it to a new directory at ./bnn/libraries/Zybo-Z7/.

Copying the compiled library to its working location.

Finally, with the library in place and the python wrapper modified you should be good to go!

Section 5. Testing the SFC on the Zybo-Z7

Goal: Test the SFC network in terms of accuracy and speed.

Note: This test works only for the SFC network, trained for MNIST. Different networks or trainings need different notebooks, so this section is specific to the SFC network trained for MNIST.

After installing the bnn package, connect to the PYNQ jupyter notebook using your browser.

1 / 2 • PYNQ's Jupyter notebook "welcome" screen on my Zybo-Z7. Let's check out the bnn directory.

Check out the bnn directory, open SFCW1A1_MNIST.ipynb and run it!

The Zybo-Z7 may be small but sure it is very fast at classifying MNIST!

We have 95.71% accuracy at 5.12us per image.

Congrats! We have implemented and tested a BNN on the small Zybo-Z7-10.