With the increasing complexity of machine learning algorithms, hardware acceleration is the key to the development of machine learning in the future. Among all hardware acceleration technologies, the graphics processing unit (GPU) is the most common way to accelerate machine learning algorithms. GPUs can take advantage of processing parallel operations on large amounts of data. However, GPUs generally lack power efficiency, which makes it unsuitable for low power requirements, for example, Edge devices and embedded systems. Therefore, it makes people start to consider other hardware technology which has lower power consumption, such as field-programmable gate array (FPGA) and application-specific integrated circuits (ASICs).
The use of non-blocking networks can rearrange data between different execution units and memories. Firstly, the non-blocking network can maintain a high-speed connection between the execution unit and the memory. Secondly, it can also improve the overall performance by eliminating the data conflict inside the data re-positioning stage. Thirdly, this architecture can allow a wide range of CNN models to run at the same hardware by compiling CNN models into different configurations of non-blocking networks.
Non-blocking networkThe Benes networks are rearrangeable nonblocking networks, which means, paths of blocked calls are established by rerouting established calls through different paths. The Benes network has N = 2^n inputs and outputs and comprises (2n - 1) stages of 2 × 2 crossbar switch elements. This network is a permutation network because it can realize all N! possible patterns of input-output orders. Fig. 1 shows a 16 × 16 Benes Network. The rectangular box in each stage is a 2 × 2 crossbar switch. The 2 × 2 crossbar switch can have two configurations, "Bar" configuration and "Cross" configuration. When input signal S = 0, the switch will go into "Bar" configuration. When input signal S = 1, the switch will go into "Cross" configuration.
Architecture designFig. 2 shows the data flow diagram of the architecture. Unlike the conventional GPU design which has a complex memory hierarchy to control and share the data between different multiprocessors, our architecture only has two separate memories and uses the dual Benes network to re-arrange the data between different execution units. This approach does not need to use a large memory bus for data exchange which can save resources and reduce latency.
Data flow in convolutional layerStep 1: Fetching weights into weight buffers
Before loading the images, we need to fetch the weights from data memory to weight buffers. This process repeats until the weight buffers are full.
Step 2: Fetching input data into Benes network
After fetching the weight data, we can start to fetch the image input data from data memory to the Benes network. The correct control signal will be generated to allow the data to pass through the multiplexers. Once the input data arrive in the Benes network, the control signals of the Benes network will be read out from the Benes control memory and send to the Benes network. Since our implementation of the Benes network is fully pipe-lined, the Benes control signals are changing in every clock cycle.
In most of the convolutional layer, image filtering usually is the first step. This process is accomplished by doing a convolution between a kernel and an image. The convolutional operation can be described as (1) and the variables in the equation are defined as:
where O(i,j) is the value of output image at position (i,j), I(i,j) is the value of output image at position (i,j), K(i,j) is the value of kernel at position (i,j). We define the following symbols to describe the layer, which is called configurations of a layer. The numbers of input and output images are Ni and No. The input image size is Ri × Ci and the output image size is Ro × Co. There are Ni × No convolutional filters, each of which connects one input image to one output image. The filter size is N × N and the stride of convolution is S. The bias is represented by b.
Fig. 3. shows an example of a 4×4 image convolutes with a 2×2 kernel. All the image data (X0-X15) will be fetched from data memory to the Benes network inputs in default sequential order. At the same time, the Benes control signals will be fetched from Benes control memory 1. With proper Benes control signals, the outputs of the Benes network is shown in fig. 4.
Step 3: Operation multiply and add
Just before the outputs of Benes network reach to parallel multipliers, all the weights will be fetched from data memory to weight buffer. The weight buffer will load all the weights in a repeat sequential order which shown as fig. 5. When all the outputs of Benes network and weights arrive to parallel multipliers, it starts to do the multiplication respectively. Then, the products will be passed to the parallel-out adder tree. The purpose of the parallel-out adder tree aims to pull out all the outputs from all layers in adder tree. For a n^2 to 1 adder tree, there will be log2(n) layers and (n^2-1) outputs in total. In order to make the total number of outputs matches n, we pad a dummy ground signal for the last output. Fig. 5. shows the signals in operation multiply and add of example 1. The desired outputs (Y0, Y2, Y6, Y8) are at (O1, O5, O9, O13) respectively.
Step 4: Bias adding and activation layer
In these two stages, the data will pass through the bias adders and activation block. In bias adders, there are 1024 floating point adders for adding the result and the bias.
Step 5: Data re-positioning for maxpooling
In order to support the various size of maxpooling, we made parallel-out comparator tree. For different filter sizes, we can get the correct sums by simply choosing the output at different clock cycles. For example, if the filter size is 4*4, the valid outputs would come out from the 4th layer. Assuming the latency in each adder tree layer is 3 cycles, the valid outputs can be gotten after 12 cycles. We also created a by-pass path for the layers which do not need maxpooling.
Step 7: Store
After going through the parallel-out comparator tree, the data will store back to the data memory and be ready for the next layer of the process.
ResultSince the implementation is too large, the Kria KV260 Vision AI Starter Kit does not have enought resources and I failed to complete place and route. However, I still attached my code on https://github.com/soul00777/Benes_net.
Comments