Pieces of the puzzle
Alternative 1: Scaling Vitis kernels in the cloud
Alternative 2: Scaling FINN ML kernels on a local cluster
Alternative 3: Scaling on multiple Pynq FPGAs
Conclusions and future directions

Zaid Al-Ars

•

Shashank Aggarwal

•

Joost Hoozemans

Published December 8, 2020 © GPL3+

OctoRay: Scalable big data FPGA cluster

We allow combining the power of FPGA acceleration with the ease of Python programming to use 100s of FPGAs for big data analytics.

IntermediateProtip1 hour1,491

Things used in this project

Hardware components

AMD Alveo

We used Alveo U50 FPGA cards available through the Nimibx cloud (https://www.nimbix.net/jarvicexe).

Software apps and online services

AMD Vitis Unified Software Platform

The Vitis platform and Vitis Libraries were used to build the FPGA accelerators.

FINN

The FINN framework was used to build the required neural network model for the project.

Python 3.6

A Python environment is used to build and run this project.

Dask

Dask was the Python framework used to create and manage the FPGA cluster for the project.

Story

The story started last year during the MSc course "Supercomputing for Big Data" we give in Delft University of Technology in the Netherlands. Yearly, this course attracts about 100 highly motivated students interested to learn about data engineering. With the growing interest in high-performance analytics and hardware acceleration, we wanted to include an FPGA acceleration component to the course to inform the students about this new and exciting field to perform analytics in a fast and efficient way.

However, the students come from diverse education backgrounds, ranging from Computer Engineering and Computer Science, to Aerospace Engineering and Technology Management. Therefore, it's not possible to ask students to write low-level C programs, let alone create hardware designs using VHDL. We needed to create a toolchain to allow students to write big data programs in high-level languages like Python and automatically use existing accelerator libraries on FPGAs.

Pieces of the puzzle

Xilinx Alveo U50 data-ceter grade FPGA used in this project

Until last year, this seemed like a distant dream, but a number of recent developments are making such integration possible. The first development happened two years ago with the availability of data-center grade FPGA cards such as Alveo in the cloud. The second happened last year, with the development of extensive ready-to-use accelerator libraries such as Vitis and Vitis AI. The third development happened just a couple of months ago, with the integration of the Pynq framework into Alveo cards, allowing Python code to run on high-end FPGAs.

In the following, our student Shashank Aggarwal shows how he integrated all these pieces of the puzzle together with the Dask multi-node Python scheduler to run scalable high-performance kernels on a cluster of FPGA accelerators. With this, he enabled many alternatives of transparent FPGA scalability, such as:

Scaling on a local FPGA cluster as well as in the cloud (Nimbix & AWS)
Scaling with many hardware kernel sources like Vitis libraries, Pynq libraries, FINN generated machine learning kernels, and manually designed kernels
Scaling on multiple Alveo U50 cards, on a mix of FPGA cards, and even on multiple Pynq cards

The code for these alternatives, along with more detailed information can be found on the OctoRay GitHub page at

https://github.com/shashank-agg/octoray

Alternative 1: Scaling Vitis kernels in the cloud

Workflow for U50 scalability in the Nimbix cloud

Here, we demonstrate using Dask (https://github.com/dask/dask) to parallelize data analytics on multiple U50 FPGAs in the Nimbix cloud. The idea is to use Dask to split input data into multiple chunks and perform the computation in parallel on FPGAs. The results are then combined into a single output file. The steps involved to parallelize any task are shown in the workflow figure above and discussed as follows.

Step 1. A Dask client reads the input data (from a file, network, etc).

Step 2. The client detects the number of workers in the cluster. It splits the input data and distributes the chunks to the workers.

Step 3. Each worker uses a Python Driver (a Pynq Overlay/custom driver) to send the data to the FPGA, and waits for the results. The results are then returned to the client.

Step 4. The client, after receiving all the results, combines them and creates a single output file.

As an example, we use the GZIP accelerator in the Vitis Data Compression Library (gzip_compression) to scale up GZIP compression on an Alveo U50 cluster in the cloud. The following plot shows an acceleration of about 1.8x achieved on using 2 U50 FPGAs to compress files of various sizes compared to 1 FPGA. In both cases, the Dask client and the workers were located on separate machines in the Nimbix cloud. The time shown is the time taken to compress the entire input file. It does not include the network I/O time taken to transmit data between the client and the worker(s).

GZIP acceleration on 2 FPGAs results in about 1.8x speedup for various file sizes

This setup is also much faster than a pure software implementation. The list below compares the throughput (in MBps) for FPGAs and software-based compression. The machine used is an Intel Xeon CPU (E5-2640 v3 @ 2.60GHz) consisting of 8 cores.

gzip: 30.6 MBps. Linux builtin single-threaded gzip tool using lowest (fastest) compression level.
pigz: 157.6 MBps. A multi-threaded implementation of gzip using lowest (fastest) compression level (https://zlib.net/pigz/).
1 FPGA: 348.3 MBps => 2.2x faster than multi-threaded software
2 FPGAs: 627.8 MBps => 4x faster than multi-threaded software

Alternative 2: Scaling FINN ML kernels on a local cluster

CIFAR-10 example images and classes

Here, we scale up the inference stage of a binarized machine learning convolutional network built with the FINN framework. The CNV-W1A1 network is used to classify 32x32 RGB images of the CIFAR-10 dataset. The bitstream was generated with the help of the test cases here - https://github.com/Xilinx/finn/blob/master/tests/end2end/test_end2end_bnn_pynq.py.

Once the accelerator kernel is built, we can use Dask to parallelize the workload and improve the speed of inference. In our setup, we split the test dataset into 8 parts and used eight kernel accelerators deployed on 8 FPGAs in the XACC Xilinx academic FPGA cluster to classify the images in these parts in parallel. As a result, a speedup of 8x was observed in the classification time when compared to using just 1 FPGA as shown in the figure below. The figure shows linear scalability with little overhead introduced by our toolchain. We estimate that we could scale up to 100s of FPGAs before we observe any limitations.

ML inference acceleration on 8 FPGAs results in linear speedup of 8x

Alternative 3: Scaling on multiple Pynq FPGAs

Setup scaling applications on a cluster of Xilinx Pynq boards

In this alternative, we built a small cluster of 2 low-power embedded Pynq-Z1 FPGAs. These boards belong to Xilinx’s Zynq-7000 family of FPGAs. FPGAs in this family consist of a programmable FPGA fabric along with an ARM-based processor. The Pynq-Z1 board, specifically, consists of a dual-core ARM A9 processor and runs a Linux OS. More importantly, it is pre-installed with a Jupyter Notebook Web server and the Pynq python package, which makes it an excellent entry-level platform to experiment and rapidly prototype solutions.

The setup has two Pynq-Z1 FPGA devices connected via a Gigabit router to a laptop. These devices were loaded with a bootable Linux image available from the Pynq website. This image includes the Pynq python package (version v2.5.1). A Dask cluster was created using the Dask Python package with Python 3.6.

As an accelerated kernel, again we use a classification task of a convolutional neural network trained on the CIFAR-10 dataset. Pretrained neural network models were obtained using the Xilinx Python package. This package provides bitstreams for different neural networks, including the 2-bits weight and activation NN we used for this experiment. It also includes a software-only implementation of the network built using Xilinx’s C++ deep learning framework Tiny-CNN.

For classifying the entire dataset using 1 Dask worker, it took a total of 38s end-to-end. This includes the time taken to read the input file, send it to the remote worker, execute the inference on the FPGA, and get the result back. Scaling the setup to 2 Dask workers resulted in reducing the total runtime was to 22 seconds, which is a speedup of 1.7x (see the figure below).

CNN acceleration to 2 Pynq FPGAs results in about 1.7x speedup end-to-end

Conclusions and future directions

This project shows that it is possible for data scientists and engineers to easily use existing tools to benefit from the immense technology capabilities of FPGA acceleration, without having to learn the details of hardware design. High-performance low-power acceleration can be accessed from the comfort of Python code for up to 100s of FPGAs. We will be using this project as a lab example in our Supercomputing for Big Data course next year. After discussing these results with a couple of international colleagues, we already received requests to use our toolchain for student courses in Korea. Our next steps would be to identify industrially relevant big data analytics pipelines, and create standardized FPGA accelerator libraries for commonly used Python functions. This will make using the whole scalable FPGA-accelerated toolchain as simple as including a library in your Python code. A video presentation of this project can be found here.

https://youtu.be/wLguV5VWPio