The goal of this projects is to help you understand the step by step process of getting a complex program to run on an AMD-Xilinx FPGA (Field Programmable Gate Array). We will explore in detail the optimizations and steps needed to make our logic run as fast and efficient as possible.
We will be developing the Poseidon hashing algorithm used in Filecoin blockchain/cryptocurrency as the complex logic we want to run on our FPGA. The algorithm is complex enough to expose many of steps in getting the most from an FPGA without being too much to handle.
- You'll gain an understanding of what AMD-Xilinx Vitis HLS (High Level Synthesis) is and how to use it.
- Examine and use #pramga needed to specialize the code to the hardware.
- Improve the throughput of logic to get the most performance.
- Reducing resource consumption to make sure your logic fits in the available resources.
- Handle issues, errors, and gotchas when dealing with complex logic.
AMD-Xilinx FPGA
This project is based on the AMD-Xilinx Varium C1100 Blockchain Accelerator Card. You can purchase this card from AMD-Xilinx or one of their partners. Other physical or cloud based AMD-Xilinx FPGA cards will work if similar in specifications and you adjust throughout the project to the card resources you have available.
Vitis Software Installed
Computer:
- Ubuntu operating system that matches the version required by Vitis software installed.
- 64 GB of RAM (32 GB may work) and a large SWAP file. AMD-Xilinx recommends 80+ GB. I did see memory usage this high and higher however most of the time I was heading in the wrong direction with optimization. When optimization was correct I utilized 32 GB or less of RAM. You will experience longer compile times if utilizing large amounts of SWAP.
- You need the hardware and software already installed. There is documentation from AMD-Xilinx Varium C1100 Compute Adaptor that I used to get my system running. See your hardware installation guide, other projects from Hackster.io, or search online for further setup help.
AMD-Xilinx Vitis HLS (High Level Synthesis) is a C / C++ / OpenCL programming subset that is compiled into RTL (Register Transfer Level) for implementation in the FPGA. The ability to work in a high level language to define hardware circuits allows for a simplified and rapid development cycle vs working directly in RTL.
The overall steps to implementing a design in HLS:
- Initial directives - creating the interface into the code.
- Dataflow - Streaming data between Load-Compute-Store functions.
- Parallelism - Pipeline to maximize throughput.
- Improve Area - fit the logic into the resources available.
- Balance - fine tuning the final result.
To achieve the best results we need to think about how data moves through the circuits of the hardware clock cycle to clock cycle.
- What pathway does a piece of data flow through?
- How long does it take to move from one group of circuits to the next?
- Will modifications of data interfere with its own state or the state of other data flowing through the system?
- How much hardware resource is needed to achieve the current design?
Don't worry we don't actually look at circuits or pathways but it is helpful to realize we are physical creating circuits on a chip and not just running software. In addition the Vitis software will help us identify when a design has violated or exceeded any of the above considerations.
Thinking about how data flows through an FPGA will take practice and you will potentially need to rethink how you write your logic. We will use many #pragma within our project to define how the compiler should handle sections of logic and how data flows through the hardware.
Example pragma:#pragma HLS dataflow
#pragma HLS array_partition variable=constantsFr complete
#pragma HLS inline
#pragma HLS pipeline II=3
HLS in C / C++ then has a few requirements in order for the compiler to successfully create the hardware definition:
- The entire functionality of the design must be contained in the HLS program. One top level function and any number of sub functions or libraries that satisfy the requirements of the compiler.
- No system calls to the operating system. An exception for printf and cout which only display data and have no impact on execution of the algorithm.
- The C / C++ constructs must be a fixed or bounded size and must be unambiguous. Since the compiler synthesizes actual connections between components on the chip it must understand these up front. No use of dynamic memory, all data must be fully self-contained and the definition of resources specified.
I found the following list of documents and web pages useful in learning HLS and FPGA development. If you get stuck along the way or want more information I highly recommend reading the following:
Vitis High-Level Synthesis User Guide (UG1399)
Vitis Unified Software Platform Documentation: Application Acceleration Development (UG1393)
Xilinx / Vitis-Tutorials - Github
Vitis HLS Migration Guide (UG1391)
Vivado Design Suite User Guide: High-Level Synthesis (UG902)
Vitis Unified Software Platform - Hello WorldScreenshots of Vitis software used with permission from AMD-Xilinx. All Vitis software is copyrighted by Xilinx, Inc.
Start up your Vitis IDE.
Let's create the basis for our project.
File -> New -> Application Project
This will create a multi-layer project. A host application that runs on the processor that will send and receive data from the kernel, a kernel application that runs on the FPGA and does the accelerated processing, and linking between the projects. There is nothing to setup in this window.
Click "Next >".
The Vitis IDE needs to know the platform we are targeting. For the Varium C1100 the board is referenced as u55n with a full name of:"xilinx_u55n_gen3x4_xdma_1_202110_1"
This may be installed under your /opt/xilinx/platforms directory if you used the default installation instructions. If you have a different accelerator card find and add to the platform page.
Click "Next >".
Give the project a name "Poseidon_Arity2".
Click "Next >".
From the templates lets select the "Simple Vector Addition". With this template Vitis will create a host application and small kernel for us.
Click "Finish".
We should end up with a project with three modules representing the kernel, hardware link, and host.
Normally we wouldn't jump so quickly to running on hardware. Running on hardware is the slowest of the stages and is usually the last step in the process. This example is designed to work out of the box so lets give it a try.
If you can build and run this code on the FPGA then your system is properly setup. If you receive errors from the build process go back and verify your installation and system setup.
1. Click on the top level "Poseidon_Arity2_system".2. Click the drop down arrow of the build icon.3. Select "3 Hardware".
You should see the Assistant window and console output showing the build process.
This will take a while to finish compiling.
Now click the run icon.
When the code runs it will add together numbers from two different datasets and return the results. A "TEST PASSED" in the console window will indicate it worked as intended.
What Actually Happened1. The Vitis IDE compiled the kernel and host applications. Our Vitis IDE was on the same machine but that is not a requirement.2. Hitting the Run icon in the IDE then cause the host OpenCL to run as a standalone program.3. The host OpenCL looked for and found our FPGA.4. The host OpenCL loaded the kernel xclbin and created an OpenCL program identifying the entry function.5. The data and task (kernel) was enqueued and sent to the FPGA.6. The kernel runs on the FPGA with the input data provided.7. Results from the kernel are sent back to the host.8. The host waits until the kernel has finished and then completes its own logic.
In your Vitis IDE take a brief look at the host code vadd.cpp and kernel code krnl_vadd.cpp to try to identify where the above steps are happening. We will get into more detail when we write our own code.
Filecoin - Blockchain For Distributed StorageFilecoin.io is a cooperative digital storage and retrieval blockchain that incentives providers to offer unused storage through through a cryptocurrency called FIL. Filecoin builds on top of IPFS (InterPlanetary File System) which is a protocol to store and access files across a distributed file system.
This project looks at the core of the protocol, the Poseidon hashing algorithm used in Merkle Tree formation. The performance of the Hashing algorithm determines how quickly a new file can be sealed and inserted into the blockchain. A better optimized implementation allows for accepting more deals in exchange for FIL coins and a more efficient implementation reduces hardware and electricity cost.
Filecoin wraps the Poseidon Hashing algorithm with additional supporting code in project Neptune filecoin-project/neptune. Neptune (at this point) is primarily driven around sending OpenCL/Cuda code to GPUs for accelerated processing. We will be looking at the Poseidon OpenCL code and accelerating with our FPGA.
Poseidon Hashing AlgorithmThe detailed mathematical definition of Poseidon Hashing is found on the Filecoin Spec website. We will look at the high level view so we can focus on optimization details as it relates to FPGAs.
A block of data is split into 32 Byte chunks and progresses through a series of "rounds" that form a cryptographic hash of the data.
Each round consists of a number of Quintic S-Box (multiplication by x^5) followed by multiplication by a number of constants found in a matrix. All of this is done over a Field using Montgomery Form to improve performance (by avoiding division operations).
The output then becomes a node in the final Merkle Tree.
Filecoin-project/Neptune OpenCL CodeTake a look at the code generated by Neptune. See Project Code section (bottom of this page) "Baseline_OpenCL_from_Neptune.cl"
This code is focused on optimizations for GPGPU usage. It is designed for running many small kernels across many small cores of a GPU.
I simplified this code (remove vendor specific code and reduced indirection) to give a clearer view of what we need to replicate and optimize for running on our FPGA. See Project Code section (bottom of this page) "Baseline_OpenCL_from_Neptune_simplified.cl"
// Returns a * b + c + d, puts the carry in d
ulong mac_with_carry_64(ulong a, ulong b, ulong c, ulong *d) {
ulong lo = a * b + c;
ulong hi = mad_hi(a, b, (ulong)(lo < c));
a = lo;
lo += *d;
hi += (lo < a);
*d = hi;
return lo;
}
// Returns a + b, puts the carry in d
ulong add_with_carry_64(ulong a, ulong *b) {
ulong lo = a + *b;
*b = lo < a;
return lo;
}
// Returns a * b + c + d, puts the carry in d
uint mac_with_carry_32(uint a, uint b, uint c, uint *d) {
ulong res = (ulong)a * b + c + *d;
*d = res >> 32;
return res;
}
// Returns a + b, puts the carry in b
uint add_with_carry_32(uint a, uint *b) {
uint lo = a + *b;
*b = lo < a;
return lo;
}
#define Fr_limb ulong
#define Fr_LIMBS 4
#define Fr_LIMB_BITS 64
#define Fr_INV 18446744069414584319
typedef struct { Fr_limb val[Fr_LIMBS]; } Fr;
__constant Fr Fr_ONE = { { 8589934590, 6378425256633387010, 11064306276430008309, 1739710354780652911 } };
__constant Fr Fr_P = { { 18446744069414584321, 6034159408538082302, 3691218898639771653, 8353516859464449352 } };
__constant Fr Fr_R2 = { { 14526898881837571181, 3129137299524312099, 419701826671360399, 524908885293268753 } };
__constant Fr Fr_ZERO = { { 0, 0, 0, 0 } };
// Greater than or equal
bool Fr_gte(Fr a, Fr b) {
for(char i = Fr_LIMBS - 1; i >= 0; i--){
if(a.val[i] > b.val[i])
return true;
if(a.val[i] < b.val[i])
return false;
}
return true;
}
// Equals
bool Fr_eq(Fr a, Fr b) {
for(uchar i = 0; i < Fr_LIMBS; i++)
if(a.val[i] != b.val[i])
return false;
return true;
}
// Normal addition
Fr Fr_add_(Fr a, Fr b) {
bool carry = 0;
for(uchar i = 0; i < Fr_LIMBS; i++) {
Fr_limb old = a.val[i];
a.val[i] += b.val[i] + carry;
carry = carry ? old >= a.val[i] : old > a.val[i];
}
return a;
}
Fr Fr_sub_(Fr a, Fr b) {
bool borrow = 0;
for(uchar i = 0; i < Fr_LIMBS; i++) {
Fr_limb old = a.val[i];
a.val[i] -= b.val[i] + borrow;
borrow = borrow ? old <= a.val[i] : old < a.val[i];
}
return a;
}
// Modular subtraction
Fr Fr_sub(Fr a, Fr b) {
Fr res = Fr_sub_(a, b);
if(!Fr_gte(a, b)) res = Fr_add_(res, Fr_P);
return res;
}
// Modular addition
Fr Fr_add(Fr a, Fr b) {
Fr res = Fr_add_(a, b);
if(Fr_gte(res, Fr_P)) res = Fr_sub_(res, Fr_P);
return res;
}
// Modular multiplication
Fr Fr_mul(Fr a, Fr b) {
/* CIOS Montgomery multiplication, inspired from Tolga Acar\'s thesis:
* https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf
* Learn more:
* https://en.wikipedia.org/wiki/Montgomery_modular_multiplication
* https://alicebob.cryptoland.net/understanding-the-montgomery-reduction-algorithm/
*/
Fr_limb t[Fr_LIMBS + 2] = {0};
for(uchar i = 0; i < Fr_LIMBS; i++) {
Fr_limb carry = 0;
for(uchar j = 0; j < Fr_LIMBS; j++)
t[j] = mac_with_carry_64(a.val[j], b.val[i], t[j], &carry);
t[Fr_LIMBS] = add_with_carry_64(t[Fr_LIMBS], &carry);
t[Fr_LIMBS + 1] = carry;
carry = 0;
Fr_limb m = Fr_INV * t[0];
mac_with_carry_64(m, Fr_P.val[0], t[0], &carry);
for(uchar j = 1; j < Fr_LIMBS; j++)
t[j - 1] = mac_with_carry_64(m, Fr_P.val[j], t[j], &carry);
t[Fr_LIMBS - 1] = add_with_carry_64(t[Fr_LIMBS], &carry);
t[Fr_LIMBS] = t[Fr_LIMBS + 1] + carry;
}
Fr result;
for(uchar i = 0; i < Fr_LIMBS; i++) result.val[i] = t[i];
if(Fr_gte(result, Fr_P)) result = Fr_sub_(result, Fr_P);
return result;
}
// Squaring is a special case of multiplication which can be done ~1.5x faster.
// https://stackoverflow.com/a/16388571/1348497
Fr Fr_sqr(Fr a) {
return Fr_mul(a, a);
}
Fr Fr_mont(Fr a) {
return Fr_mul(a, Fr_R2);
}
Fr Fr_unmont(Fr a) {
Fr one = Fr_ZERO;
one.val[0] = 1;
return Fr_mul(a, one);
}
Fr quintic_s_box(Fr l, Fr pre_add, Fr post_add) {
Fr tmp = Fr_add(l, pre_add);
tmp = Fr_sqr(l);
tmp = Fr_sqr(tmp);
tmp = Fr_mul(tmp, l);
tmp = Fr_add(tmp, post_add);
return tmp;
}
Fr scalar_product(__constant Fr* a, Fr* b, int size) {
Fr res = Fr_ZERO;
for (int i = 0; i < size; ++i) {
Fr tmp = Fr_mul(a[i], b[i]);
res = Fr_add(res, tmp);
}
return res;
}
typedef struct state_2_standard {
Fr elements[3];
int current_round;
int rk_offset;
} state_2_standard;
state_2_standard apply_matrix_2_standard (__constant Fr matrix[3][3], state_2_standard s) {
Fr tmp[3];
for (int i = 0; i < 3; ++i) {
tmp[i] = s.elements[i];
s.elements[i] = Fr_ZERO;
}
for (int j = 0; j < 3; ++j) {
for (int i = 0; i < 3; ++i) {
s.elements[j] = Fr_add(s.elements[j], Fr_mul(matrix[i][j], tmp[i]));
}
}
return s;
}
state_2_standard apply_sparse_matrix_2_standard (__constant Fr sm[5], state_2_standard s) {
Fr first_elt = s.elements[0];
s.elements[0] = scalar_product(sm + 0, s.elements, 3);
for (int i = 1; i < 3; ++i) {
Fr val = Fr_mul((sm + 3)[i-1], first_elt);
s.elements[i] = Fr_add(s.elements[i], val);
}
return s;
}
state_2_standard apply_round_matrix_2_standard (__constant Fr constants[373], state_2_standard s) {
if (s.current_round == 3) {
s = apply_matrix_2_standard((__constant Fr (*)[3])(constants + 89), s);
} else if ((s.current_round > 3) && (s.current_round < 4 + 55)) {
int index = s.current_round - 3 - 1;
s = apply_sparse_matrix_2_standard(constants + 98 + (index * 5), s);
} else {
s = apply_matrix_2_standard((__constant Fr (*)[3])(constants + 80), s);
}
return s;
}
state_2_standard add_full_round_keys_2_standard (__constant Fr constants[373], state_2_standard s) {
for (int i = 0; i < 3; ++i) {
s.elements[i] = Fr_add(s.elements[i], (constants + 1)[s.rk_offset + i]);
}
s.rk_offset += 3;
return s;
}
state_2_standard full_round_2_standard (__constant Fr constants[373], state_2_standard s) {
for (int i = 0; i < 3; ++i) {
s.elements[i] = quintic_s_box(s.elements[i], Fr_ZERO, (constants + 1)[s.rk_offset + i]);
}
s.rk_offset += 3;
s = apply_round_matrix_2_standard(constants, s);
s.current_round += 1;
return s;
}
state_2_standard last_full_round_2_standard (__constant Fr constants[373], state_2_standard s) {
for (int i = 0; i < 3; ++i) {
s.elements[i] = quintic_s_box(s.elements[i], Fr_ZERO, Fr_ZERO);
}
s = apply_round_matrix_2_standard(constants, s);
return s;
}
state_2_standard partial_round_2_standard (__constant Fr constants[373], state_2_standard s) {
s.elements[0] = quintic_s_box(s.elements[0], Fr_ZERO, (constants + 1)[s.rk_offset]);
s.rk_offset += 1;
s = apply_round_matrix_2_standard(constants, s);
s.current_round += 1;
return s;
}
state_2_standard hash_2_standard (__constant Fr constants[373], state_2_standard s) {
s = add_full_round_keys_2_standard(constants, s);
for (int i = 0; i < 4; ++i) {
s = full_round_2_standard(constants, s);
}
for (int i = 0; i < 55; ++ i) {
s = partial_round_2_standard(constants, s);
}
for (int i = 0; i < (4 - 1); ++ i) {
s = full_round_2_standard(constants, s);
}
s = last_full_round_2_standard(constants, s);
return s;
}
__kernel void hash_preimages_2_standard(__constant Fr constants[373],
__global Fr *preimages,
__global Fr *digests,
int batch_size
) {
int global_id = get_global_id(0);
if (global_id < batch_size) {
int offset = global_id * 2;
state_2_standard s;
s.elements[0] = constants[0];
for (int i = 0; i < 2; ++i) {
s.elements[i+1] = preimages[offset + i];
}
s.current_round = 0;
s.rk_offset = 0;
s = hash_2_standard(constants, s);
digests[global_id] = s.elements[1];
}
}
There are two major sections:
- First half (lines 1 - 50) is dedicated to Montgomery Modular Multiplication which is a way of performing fast modular multiplication. It uses a special representation of a number called a Montgomery Form in which it is more efficient to calculate a*b mod N. The efficiency is gained from avoiding expensive division operations.
- Second half (lines 51 - 93) is the Poseidon Hashing implementation. The entry point is the last function
hash_preimages_2_standard(...)
which calls thehash_2_standard(...)
that does the looping over the rounds of calculation.
Let's go step by step and build small sections at a time. We will be using C++ to create our kernel so we can take full advantage of HLS #pragma and data types for optimizing the hardware.
Create a new file under "Poseidon_Arity2_kernels/src" and call it "krnl_poseidon_arity2.cpp".
And insert the following starting code:
See Project Code section (bottom of this page) "krnl_poseidon_arity2_1.cpp"
#include <stdint.h>
#include <hls_stream.h>
#include <cstdio>
#include "ap_int.h"
#include <iostream>
#include <math.h>
using namespace std;
typedef struct Fr {
unsigned long long val[4];
} Fr;
typedef struct state_2_standard {
Fr elements[3];
int current_round;
int rk_offset;
} state_2_standard;
static void load_input(Fr* preimages, hls::stream<state_2_standard> &preimage_stream, const Fr constantZero, int size) {
#pragma HLS inline off
load_input: for (int i = 0; i < size; i++) {
state_2_standard s;
s.elements[0] = constantZero;
for (int j=0; j < 2; j++) {
Fr preimage = preimages[i * 2 + j];
s.elements[j+1] = Fr { (preimage.val[3], preimage.val[2], preimage.val[1], preimage.val[0]) };
}
s.current_round = 0;
s.rk_offset = 0;
preimage_stream << s;
}
}
static void hash_2_standard(const Fr constants[], hls::stream<state_2_standard> &in, hls::stream<Fr> &out, int size) {
#pragma HLS inline off
compute: for (int i = 0; i < size; i++) {
state_2_standard s = in.read();
int testAdd = size * 2;
cout << "hello from the kernel " << testAdd << endl;
out.write(Fr { { 0, 0, 0, 0 } });
}
}
static void store_result(hls::stream<Fr> &out_stream, Fr* out, int size) {
#pragma HLS inline off
store_result: for (int i = 0; i < size; i++) {
out[i] = out_stream.read();
}
}
extern "C" {
void poseidon_hash(
const Fr constants[373],
Fr *preimages,
Fr *digests,
int batch_size
) {
#pragma HLS interface m_axi port = preimages bundle = gmem0
#pragma HLS interface m_axi port = digests bundle = gmem1
#pragma HLS dataflow
const unsigned int correctSize = batch_size/2;
Fr constantsZero;
static hls::stream<state_2_standard> preimage_stream("preimage_stream");
static hls::stream<Fr> digests_stream("digests_stream");
#pragma HLS stream variable=preimage_stream
#pragma HLS stream variable=digests_stream
constantsZero = constants[0];
load_input(preimages, preimage_stream, constantsZero, correctSize);
hash_2_standard(constants, preimage_stream, digests_stream, correctSize);
store_result(digests_stream, digests, correctSize);
}
}
- Create a entry point function
poseidon_hash(...)
with the inputs we need to match the API from Neptune. This doesn't exactly match the OpenCL entry point because we want to do additional Dataflow optimizations before getting to the main hash function. This will be our top level function that Vitis will link to from the host application. - Vitis HLS requires the kernel top level function declaration to be wrapped with
extern "C"
in either a header or in the main file. - Create
struct Fr
that will be the data container we use to send and receive data from the host. This defines the API contract that Neptune uses so we need to maintain this struct. - Create
struct state_2_standard
as our internal data container for use in the algorithm. - Setup dataflow functions (more details below).
- Add labels like
store_result:
to loops. When we analyze through Vitis HLS the label will be shown in the UI which is much easier to read than the compiler generated loop name.
Let's create the host program so we have code that can run our kernel. The host utilizes OpenCL to manage the entire life cycle between system and accelerator. OpenCL (in any programming language) will be similar in structure.
Look at vadd.cpp to see the structure.
Lets create our own file now.
- Copy the "vadd.cpp" and name it "host_poseidon.cpp".
- Rename "vadd.cpp" to "vadd.txt" so it doesn't cause issues with two
main()
functions in the same project.
Let's modify the "host_poseidon.cpp" to fit our needs (details will be explained later). SeeProject Code section (bottom of this page) "host_poseidon_1.cpp".
- Add a new include to allow for bit precise numbers:
#include "ap_int.h"
. - Reduce our data size to 2 for development (arity 2 means we have two inputs so this is the minimum test):
static const int DATA_SIZE = 2;
- When creating the OpenCL kernel definition we need to give it the entry function we defined in the kernel. In our case it is "poseidon_hash":
OCL_CHECK(err, krnl_vector_add = cl::Kernel(program,"poseidon_hash", &err));
- Change the input and output variables to those needed by the kernel entry function:
buffer_constants
buffer_preimages
buffer_digests
Fr *ptr_constants
Fr *ptr_preimages
Fr *ptr_digests - Setup a little data:
for (unsigned long long i = 0; i < 373; i++) {
ptr_constants[i] = Fr { { i, 0, 0, 0 } };
}
for (unsigned long long i = 0; i < DATA_SIZE; i++) {ptr_preimages[i] = Fr { { i, 0, 0, 0 } };
} - Finally lets add a few outputs to show what's happening.
#define OCL_CHECK(error, call) \
call; \
if (error != CL_SUCCESS) { \
printf("%s:%d Error calling " #call ", error code is: %d\n", __FILE__, __LINE__, error); \
exit(EXIT_FAILURE); \
}
#include <stdlib.h>
#include <fstream>
#include <iostream>
#include "vadd.h"
#include "ap_int.h"
using namespace std;
static const int DATA_SIZE = 1;
static const std::string error_message =
"Error: Result mismatch:\n"
"i = %d CPU result = %d Device result = %d\n";
typedef struct Fr {
unsigned long long val[4];
} Fr;
int main(int argc, char* argv[]) {
//TARGET_DEVICE macro needs to be passed from gcc command line
if(argc != 2) {
std::cout << "Usage: " << argv[0] <<" <xclbin>" << std::endl;
return EXIT_FAILURE;
}
std::string xclbinFilename = argv[1];
// Compute the size of array in bytes
size_t size_in_bytes = DATA_SIZE * sizeof(int);
size_t constants_size_in_bytes = 373 * sizeof(Fr);
// Creates a vector of DATA_SIZE elements with an initial value of 10 and 32
// using customized allocator for getting buffer alignment to 4k boundary
std::vector<cl::Device> devices;
cl::Device device;
cl_int err;
cl::Context context;
cl::CommandQueue q;
cl::Kernel krnl_poseidon_hash;
cl::Program program;
std::vector<cl::Platform> platforms;
bool found_device = false;
//traversing all Platforms To find Xilinx Platform and targeted
//Device in Xilinx Platform
cl::Platform::get(&platforms);
for(size_t i = 0; (i < platforms.size() ) & (found_device == false) ;i++){
cl::Platform platform = platforms[i];
std::string platformName = platform.getInfo<CL_PLATFORM_NAME>();
if ( platformName == "Xilinx"){
devices.clear();
platform.getDevices(CL_DEVICE_TYPE_ACCELERATOR, &devices);
if (devices.size()){
device = devices[0];
found_device = true;
break;
}
}
}
if (found_device == false){
std::cout << "Error: Unable to find Target Device "
<< device.getInfo<CL_DEVICE_NAME>() << std::endl;
return EXIT_FAILURE;
}
// Creating Context and Command Queue for selected device
OCL_CHECK(err, context = cl::Context(device, NULL, NULL, NULL, &err));
OCL_CHECK(err, q = cl::CommandQueue(context, device, CL_QUEUE_PROFILING_ENABLE, &err));
std::cout << "INFO: Reading " << xclbinFilename << std::endl;
FILE* fp;
if ((fp = fopen(xclbinFilename.c_str(), "r")) == nullptr) {
printf("ERROR: %s xclbin not available please build\n", xclbinFilename.c_str());
exit(EXIT_FAILURE);
}
// Load xclbin
std::cout << "Loading: '" << xclbinFilename << "'\n";
std::ifstream bin_file(xclbinFilename, std::ifstream::binary);
bin_file.seekg (0, bin_file.end);
unsigned nb = bin_file.tellg();
bin_file.seekg (0, bin_file.beg);
char *buf = new char [nb];
bin_file.read(buf, nb);
// Creating Program from Binary File
cl::Program::Binaries bins;
bins.push_back({buf,nb});
devices.resize(1);
OCL_CHECK(err, program = cl::Program(context, devices, bins, NULL, &err));
// This call will get the kernel object from program. A kernel is an
// OpenCL function that is executed on the FPGA.
OCL_CHECK(err, krnl_poseidon_hash = cl::Kernel(program,"poseidon_hash", &err));
// These commands will allocate memory on the Device. The cl::Buffer objects can
// be used to reference the memory locations on the device.
OCL_CHECK(err, cl::Buffer buffer_constants(context, CL_MEM_READ_ONLY, constants_size_in_bytes, NULL, &err));
OCL_CHECK(err, cl::Buffer buffer_preimages(context, CL_MEM_READ_ONLY, size_in_bytes, NULL, &err));
OCL_CHECK(err, cl::Buffer buffer_digests(context, CL_MEM_WRITE_ONLY, size_in_bytes, NULL, &err));
//set the kernel Arguments
int narg=0;
OCL_CHECK(err, err = krnl_poseidon_hash.setArg(narg++,buffer_constants));
OCL_CHECK(err, err = krnl_poseidon_hash.setArg(narg++,buffer_preimages));
OCL_CHECK(err, err = krnl_poseidon_hash.setArg(narg++,buffer_digests));
OCL_CHECK(err, err = krnl_poseidon_hash.setArg(narg++,DATA_SIZE));
//We then need to map our OpenCL buffers to get the pointers
Fr *ptr_constants;
Fr *ptr_preimages;
Fr *ptr_digests;
OCL_CHECK(err, ptr_constants = (Fr*)q.enqueueMapBuffer (buffer_constants , CL_TRUE , CL_MAP_WRITE , 0, constants_size_in_bytes, NULL, NULL, &err));
OCL_CHECK(err, ptr_preimages = (Fr*)q.enqueueMapBuffer (buffer_preimages , CL_TRUE , CL_MAP_WRITE , 0, size_in_bytes, NULL, NULL, &err));
OCL_CHECK(err, ptr_digests = (Fr*)q.enqueueMapBuffer (buffer_digests , CL_TRUE , CL_MAP_READ , 0, size_in_bytes, NULL, NULL, &err));
cout << "setting up data" << endl;
for (unsigned long long i = 0; i < 373; i++) {
ptr_constants[i] = Fr { { i, 0, 0, 0 } };
}
for (unsigned long long i = 0; i < DATA_SIZE; i++) {
ptr_preimages[i] = Fr { { i, 0, 0, 0 } };
}
// Data will be migrated to kernel space
OCL_CHECK(err, err = q.enqueueMigrateMemObjects({buffer_preimages}, 0/* 0 means from host*/));
//Launch the Kernel
OCL_CHECK(err, err = q.enqueueTask(krnl_poseidon_hash));
// The result of the previous kernel execution will need to be retrieved in
// order to view the results. This call will transfer the data from FPGA to
// source_results vector
OCL_CHECK(err, q.enqueueMigrateMemObjects({buffer_digests},CL_MIGRATE_MEM_OBJECT_HOST));
cout << "waiting" << endl;
OCL_CHECK(err, q.finish());
cout << "finish" << endl;
//Verify the result
int match = 0;
for (int i=0; i< (DATA_SIZE >> 1); i++) {
cout << " output " << ptr_digests[i].val[0] << " " << ptr_digests[i].val[1] << " " << ptr_digests[i].val[2] << " " << ptr_digests[i].val[3] << endl;
}
OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_constants, ptr_constants));
OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_preimages, ptr_preimages));
OCL_CHECK(err, err = q.enqueueUnmapMemObject(buffer_digests, ptr_digests));
OCL_CHECK(err, err = q.finish());
cout << "TEST " << (match ? "FAILED" : "PASSED") << std::endl;
return (match ? EXIT_FAILURE : EXIT_SUCCESS);
}
Since we are creating our own kernel we need to tell Vitis the new kernel entry function.
- Open the file "Poseidon_Arity2_kernels.prj".
- Remove the "krnl_vadd" hardware function.
- Click "Add Hardware Function" and select "poseidon_hash".
The primary (and fastest) way to identify if the logic is correct is through software emulation. This is also the fastest compilation method so identifying issues with software emulation is recommended.
Software emulation will run the code against the system CPU. Certain hardware specific optimizations like Dataflow will be analyzed during the run but most other #pragma are ignored.
Selecting the project will build all sub modules. If you only have a sub module selected then only that sub module will build.
- Select the full project "Poseidon_Arity2_system".
- Select the build icon drop down and select "2 Emulation-SW". Fix any errors reported in the console window. If you fixed errors repeat the build.
- Select the run icon. Output will be sent to the console.
Once our code is working correctly then we move to hardware optimization. All of the #pragma and optimizations now come into effect. Vitis will build an RTL design and when we run the logic this RTL design (as opposed to the code) is the basis for the execution and output.
The build process logs show how optimizations are being applied, errors from #pragma, and some resource information like how many multipliers or mutexes will be created.
- From the build icon drop down select "1 Emulation-HW". Fix any errors reported in the console window. If you fixed errors repeat the build.
- Select the run icon. Output will be sent to the console.
More detailed hardware optimization analysis is done through the Vitis HLS application. You can launch the application from the Vitis IDE.
- Build using "Emulate-HW".
- Open the file "Poseidon_Arity2_kernels.prj".
- Select the "poseidon_hash" hardware function.
- Click the "Launch Vitis HLS..." button.
- Read and click "OK" on the "Launch Vitis HLS" window.
- The Vitis HLS application will open.
Look around the application.
- You can edit the kernel.
- Run C Synthesis (Emulation-HW).
- View multiple types of reports.
- See how fast the clock timing was estimated at.
- Look at performance metrics like latency, interval (II - iteration interval), if sections were pipelined, and the number of resources used.
- See the Guidance tab output (log output from the compiler).
Back to our kernel to optimize.
#pragma HLS interfaceThis #pragma defines how we want the data ports to interface with the entry function in the kernel. This is possibly the closest we get to needing to understand something about RTL. Vitis will try to infer as much as possible but we have the option of overriding those defaults.
#pragma HLS interface m_axi port = preimages bundle = gmem0
#pragma HLS interface m_axi port = digests bundle = gmem1
These #pragma are specifying how the data from the host will be connected to the kernel. We will use an AXI4 interface in which we will use separate ports (gmem0 and gmem1) for preimages vs digests.
Wikipedia defines the AXI4 specification as:
a parallel high-performance, synchronous, high-frequency, multi-initiator, multi-target communication interface, mainly designed for on-chip communication
Technically one port can read and write at the same time so this could be both gmem0 but it is important to realize that multiple inputs/outputs should be in different bundles to provide higher throughput to the kernel.
The input constants
and batch_size
were not specified and Vitis was able to infer how they should be connected (from Vitis HLS).
Exploiting Task Level Parallelism Dataflow Optimization
The first optimization to apply is dataflow. Dataflow allows for data to be passed between functions as a stream (as opposed to batches). With dataflow the entirety of data doesn't need to be finished, it will pass each element of data onto the next consumer as soon as the producer has completed. The functions involved in dataflow use Producer-Consumer Paradigm which sets up channels between functions to allow the hardware to start processing data as it becomes available.
Vitis recommends setting up Load-Compute-Store functions and applying dataflow between them. This is why we specified a different entry point into our kernel "krnl_poseidon_arity2.cpp" so that dataflow could be setup. Our poseidon_hash(...)
function becomes dedicated to handling dataflow. The hls::stream<...>
defines the channel between functions and the three functions at the bottom represent Load-Compute-Store.
(Note: the image above has #pragma
HLS stream
this is only needed if your channel is an array).
The #pragma HLS dataflow
can be defined in a function or loop.
We must also take into account certain limitations of dataflow:
- Restrictions on reading from function inputs or writing to function outputs in the middle of the dataflow region.
- Single-producer-consumer pattern must be maintained.
- Bypassing tasks and channel sizing can reduce performance.
- Feedback between tasks limitations.
- Conditional execution of tasks can not be used.
- Loops with multiple exit conditions can not be used.
Given the narrow scope of Load-Compute-Store and the fact that dataflow is not hierarchical (doesn't apply to sub functions) there is no issue in violating these limitations.
Further in canonical forms of specifying the dataflow we must not inline sub functions. We can do this with #pragma HLS inline off
inside of the three functions.
Variable used in the design can be stored in memory or registers within the RTL design. Each of these has a limited numbers of read and write ports connected and thus a limited amount of throughput to that storage.
An array is no different. Even though it stores multiple pieces of data the entire structure can only be access as one memory location.
This algorithm relies on a Fr constants[...]
array that is used in multiple locations of the logic. In order to avoid a bottleneck on this array we need to make it appear as many small memory locations to the RTL.
Note: This #pragma can not be used on the entry function inputs using m_axi ports. The design could be changed to use hls::vector
at the interface level. For now we'll stick with an array to explore the #pragma.
Create a new variable and define the #pragma after.
Fr constantsFr[373];
#pragma HLS array_partition variable=constantsFr complete
Specifying complete
tells Vitis we want every element in the array to be its own memory location. Other options allow creating smaller arrays by specifying cyclic
or block
.
Transfer the data from the constants
to contantsFr
array.
And call from our entry function.
So far we have used C++ data types like unsigned long long
and int
but we can get more specific in the size of these variables. In hardware every extra bit represents a extra wire and memory location that must be built into the design. In addition each extra bit could require using more resources like DSPs (Digital Signal Processing) or LUTs (Look Up Table) for doing operations on the data.
To be bit specific switch to using:
ap_int<N>
for signed integersap_uint<N>
for unsigned integersap_fixed<W, I, Q, O, N>
for signed fixed point numbersap_ufixed<W, I, Q, O, N>
for unsigned fixed point numbers
A word of caution: we may take for granted int
with its billions of possible values and don't think about boundary conditions involved in sizing our variables. As an example, if we find we have a variable that has say a maximum value of 16 our first instinct may be to use an ap_uint<4>
which would contain 2^4 (or 16) possible values. However, an ap_uint<4>
only covers from 0 to 15 and we would never be able to satisfy our maximum of 16. As such we should choose an ap_uint<5>
which contain 2^5 (or 32) possible values from 0 to 31. If we determine the variable should allow negative values then then we need an ap_int<6>
to cover -31 to 31.
Arbitrary Precision Documentation
#include "ap_int.h"
ap_int
supports up to 1024 bit numbers (can be increased to 4096 bits with a configuration change). The standard arithmetic, bit logic, increment/decrement, and relational operators are identical to c++ operators. Additional bit selection, bit concatenation, bit range selection, and bit manipulation operators are available.
ap_fixed can represent float
and double
from c++ or various other level of precision.
See documentations for further details.
Let's change all of our C++ variable types to ap_int
data types and size them to the minimum bits needed.
This struct uses an array of 4 64-bit values due to the limitation of the host only being able to handle up to unsigned long long
. The number this struct represents is then val[0] + val[1] >> 64 + val[2] >> 128 + val[3] >> 192
.
Our FPGA can handle the full 256 bit number. This simplifies some of the arithmetic operations that previously needed to work on limbs that now can operate on simple numbers. This may also perform better with fewer cycles and resources.
We still need to handle the hosts limitations so we create a Fr_4Limb
and modify the existing Fr to to use ap_uint<256>
.
We can convert the Fr_4limb
to Fr
using the concatenation (see "Contenation" under C++ Arbitrary Precision Types) operation (use of parentheses with commas) of ap_uint
. Notice how the Fr { { , , } }
changed to Fr { ( , , , ) }
.
Note: we also switch from array format of little endian (least significant data on the left) to number format of big endian (least significant data on the right).
When we store the data for sending back to the host we need to handle the hosts limitations again. This time we use the range selection (see "Range Selection" under C++ Arbitrary Precision Types) to select the 64 bits of the larger number to create each limb.
Wikipedia - Montgomery Modular Multiplication
Montgomery Modular Multiplication is a performance improvement that avoid the division operation when calculating a*b mod N. Number are converted to Montgomery Form before arithmetic operations are applied and then converted back to normal form when the logic is finished.
Looking at the original (simplified) OpenCL code we see logic around handling the limbs. In the functions for addition and subtraction we must iterate over the four limbs and carry or borrow along the way.
Fr Fr_add_(Fr a, Fr b) {
bool carry = 0;
for(uchar i = 0; i < Fr_LIMBS; i++) {
Fr_limb old = a.val[i];
a.val[i] += b.val[i] + carry;
carry = carry ? old >= a.val[i] : old > a.val[i];
}
return a;
}
Fr Fr_sub_(Fr a, Fr b) {
bool borrow = 0;
for(uchar i = 0; i < Fr_LIMBS; i++) {
Fr_limb old = a.val[i];
a.val[i] -= b.val[i] + borrow;
borrow = borrow ? old <= a.val[i] : old < a.val[i];
}
return a;
}
Because our FPGA can handle a 256 bit number we can eliminate this complexity and just add or subtract the numbers.
static Fr Fr_add_(Fr a, Fr b) {
return Fr { a.val + b.val };
}
static Fr Fr_sub_(Fr a, Fr b) {
return Fr { a.val - b.val };
}
Multiplication also gets a little simpler going from a doubly nest loop over the limbs.
// Modular multiplication
Fr Fr_mul(Fr a, Fr b) {
/* CIOS Montgomery multiplication, inspired from Tolga Acar\'s thesis:
* https://www.microsoft.com/en-us/research/wp-content/uploads/1998/06/97Acar.pdf
* Learn more:
* https://en.wikipedia.org/wiki/Montgomery_modular_multiplication
* https://alicebob.cryptoland.net/understanding-the-montgomery-reduction-algorithm/
*/
Fr_limb t[Fr_LIMBS + 2] = {0};
for(uchar i = 0; i < Fr_LIMBS; i++) {
Fr_limb carry = 0;
for(uchar j = 0; j < Fr_LIMBS; j++)
t[j] = mac_with_carry_64(a.val[j], b.val[i], t[j], &carry);
t[Fr_LIMBS] = add_with_carry_64(t[Fr_LIMBS], &carry);
t[Fr_LIMBS + 1] = carry;
carry = 0;
Fr_limb m = Fr_INV * t[0];
mac_with_carry_64(m, Fr_P.val[0], t[0], &carry);
for(uchar j = 1; j < Fr_LIMBS; j++)
t[j - 1] = mac_with_carry_64(m, Fr_P.val[j], t[j], &carry);
t[Fr_LIMBS - 1] = add_with_carry_64(t[Fr_LIMBS], &carry);
t[Fr_LIMBS] = t[Fr_LIMBS + 1] + carry;
}
Fr result;
for(uchar i = 0; i < Fr_LIMBS; i++) result.val[i] = t[i];
if(Fr_gte(result, Fr_P)) result = Fr_sub_(result, Fr_P);
return result;
}
To a single loop for our FPGA. We are still maintaining a notion of multiplication by smaller parts of the number for two reasons:
- The Fr_INV value is directly associated to LSB (Least Significant Bits) of the number. If we were to re-calculate the Fr_INV we could apply to the full 256 bit number.
- Using smaller numbers in multiplication has a direct effect on a reduction in the number of DSP required to complete the multiplication. Balancing between this form with additional multiplications and a straight 256 bit multiplication (512 bit result) is part of the optimization process.
static ap_uint<384> Fr_mul_part(Fr a, Fr b, ap_uint<384> t384, ap_uint<3> i) {
#pragma HLS inline
//ap_uint<8> range = i * ap_uint<7>(64);
ap_uint<64> bSubRange = b.val.range((i * 64) + 63, i * 64);
t384 += a.val * bSubRange;
ap_uint<64> t_first_limb = t384.range(63, 0);
ap_uint<64> m = Fr_INV * t_first_limb;
ap_uint<129> res = m * Fr_P_first_limb + t384(63, 0);
ap_uint<64> carry = res.range(128, 64);
return m * Fr_P_remainder + t384(383, 64) + carry;
}
static Fr Fr_mul(Fr a, Fr b) {
#pragma HLS inline
ap_uint<384> t384 = ap_uint<384>(0);
Fr_mul_loop: for (ap_uint<3> i = 0; i < 4; i++) {
t384 = Fr_mul_part(a, b, t384, i);
}
Fr result = Fr { (t384(255, 192), t384(191, 128), t384(127, 64), t384(63, 0)) };
if (Fr_gte(result, Fr_P)) {
result = Fr_sub_(result, Fr_P);
}
return result;
}
Add a few examples to the compute portion to see if and how Montgomery Modular Multiplication is working.
static void hash_2_standard(const Fr constants[], hls::stream<state_2_standard> &in, hls::stream<Fr> &out, ap_uint<9> size) {
#pragma HLS inline off
compute: for (ap_uint<9> i = 0; i < size; i++) {
state_2_standard s = in.read();
int testAdd = size * 2;
Fr Fr_one = Fr_mont(Fr { 1 });
cout << "one (in montgomery form) " << Fr_unmont(Fr_two).val << endl;
Fr Fr_two = Fr_add(Fr_one, Fr_one);
cout << "two (in montgomery form) " << Fr_two.val << endl;
cout << "test add 1 + 1 (regular number)" << Fr_unmont(Fr_two).val << endl;
Fr test_multiply = Fr_mul(Fr_two, Fr_two);
cout << "four (in montgomery form) " << test_multiply.val << endl;
cout << "test multiply (regular number " << Fr_unmont(test_multiply).val << endl;
out.write(Fr { ap_uint<256>(1) });
}
}
Analyzing through Vitis HLS shows very little resource (DSP, FF, LUT) usage but we are also not doing the core portion of the algorithm yet.
The functions just added are not shown in the hierarchy. Vitis will automatically inline small funcitons and we have specified #pragma HLS inline
for the Fr_mul
and Fr_mul_part
functions. To have these functions show up we can specify #pragma HLS inline off
to force Vitis to treat those functions as independent RTL.
First update the host to add a golden data set for testing.
See Project Code section (bottom of this page) "host_poseidon_2.cpp"
The most straight forward implementation of the hashing code is to translate from OpenCL to C++ and apply the lessons we have learned so far.
static Fr quintic_s_box(Fr l, Fr pre_add, Fr post_add) {
Fr tmp = Fr_add(l, pre_add);
tmp = Fr_mul(l, l);
tmp = Fr_mul(tmp, tmp);
tmp = Fr_mul(tmp, l);
tmp = Fr_add(tmp, post_add);
return tmp;
}
static Fr scalar_product(const Fr* a, Fr* b, ap_uint<2> size) {
Fr res = Fr_ZERO;
multiply_add_loop: for (ap_uint<2> i = 0; i < size; ++i) {
Fr tmp = Fr_mul(a[i], b[i]);
res = Fr_add(res, tmp);
}
return res;
}
static state_2_standard apply_matrix_2_standard(const Fr matrix[9], state_2_standard s) {
Fr tmp[3];
initialize: for (ap_uint<2> i = 0; i < 3; ++i) {
tmp[i] = s.elements[i];
s.elements[i] = Fr_ZERO;
}
multiply_add_loop: for (ap_uint<2> j = 0; j < 3; ++j) {
for (ap_uint<2> i = 0; i < 3; ++i) {
s.elements[j] = Fr_add(s.elements[j], Fr_mul(matrix[i * 3 + j], tmp[i]));
}
}
return s;
}
static state_2_standard apply_sparse_matrix_2_standard(const Fr sm[5], state_2_standard s) {
Fr first_elt = s.elements[0];
s.elements[0] = scalar_product(sm + 0, s.elements, 3);
multiply_add_loop: for (ap_uint<2> i = 1; i < 3; ++i) {
Fr val = Fr_mul((sm + 3)[i-1], first_elt);
s.elements[i] = Fr_add(s.elements[i], val);
}
return s;
}
static state_2_standard apply_round_matrix_2_standard(const Fr constants[373], state_2_standard s) {
if (s.current_round == 3) {
s = apply_matrix_2_standard((const Fr (*))(constants + 89), s);
} else if ((s.current_round > 3) && (s.current_round < 4 + 55)) {
int index = s.current_round - 3 - 1;
s = apply_sparse_matrix_2_standard(constants + 98 + (index * 5), s);
} else {
s = apply_matrix_2_standard((const Fr (*))(constants + 80), s);
}
return s;
}
static state_2_standard add_full_round_keys_2_standard(const Fr constants[373], state_2_standard s) {
add_loop: for (ap_uint<2> i = 0; i < 3; ++i) {
s.elements[i] = Fr_add(s.elements[i], (constants + 1)[s.rk_offset + i]);
}
s.rk_offset += 3;
return s;
}
static state_2_standard full_round_2_standard(const Fr constants[373], state_2_standard s) {
quintic_s_box_loop: for (ap_uint<2> i = 0; i < 3; ++i) {
s.elements[i] = quintic_s_box(s.elements[i], Fr_ZERO, (constants + 1)[s.rk_offset + i]);
}
s.rk_offset += 3;
s = apply_round_matrix_2_standard(constants, s);
s.current_round += 1;
return s;
}
static state_2_standard last_full_round_2_standard(const Fr constants[373], state_2_standard s) {
quintic_s_box_loop: for (ap_uint<2> i = 0; i < 3; ++i) {
s.elements[i] = quintic_s_box(s.elements[i], Fr_ZERO, Fr_ZERO);
}
s = apply_round_matrix_2_standard(constants, s);
return s;
}
static state_2_standard partial_round_2_standard(const Fr constants[373], state_2_standard s) {
s.elements[0] = quintic_s_box(s.elements[0], Fr_ZERO, (constants + 1)[s.rk_offset]);
s.rk_offset += 1;
s = apply_round_matrix_2_standard(constants, s);
s.current_round += 1;
return s;
}
static void hash_2_standard(Fr constants[373], hls::stream<state_2_standard> &preimage_stream, hls::stream<Fr> &digests, ap_uint<32> size) {
dataflow_loop:for (ap_uint<32> i = 0; i < size; i++) {
state_2_standard s = preimage_stream.read();
s = add_full_round_keys_2_standard(constants, s);
for (ap_uint<3> i = 0; i < 4; ++i) {
s = full_round_2_standard(constants, s);
}
for (ap_uint<6> i = 0; i < 55; ++ i) {
s = partial_round_2_standard(constants, s);
}
for (ap_uint<2> i = 0; i < (4 - 1); ++ i) {
s = full_round_2_standard(constants, s);
}
s = last_full_round_2_standard(constants, s);
digests << s.elements[1];
}
}
- Dataflow is added to the "compute" method
hash_2_standard
.
- Add labels to loops.
- Change to
ap_uint
data types.
Running software emulation passes our golden data set validation, the true test is how fast and efficient is the logic when we analyze the hardware emulation.
#pragma HLS LOOP_TRIPCOUNTQuick detour to add a loop tripcount. This is for Vitis HLS analysis only but will give information on how many times a loop is called (when it can't be inferred). Vitis can then display latency values where it otherwise wouldn't have known. I set 1000 for an easy number to analyze. The true loop count is much higher.
- The Iteration Latency column says how long it takes to finish one iteration of the logic. This code is taking 55, 654 clock cycles per piece of data.
- The Latency(cycles) is how long it takes to finish all data for that function. Given
hash_2_standard
is our compute function this is the primary driver for performance. This code will execute 1000 hashes in 55, 654, 001 cycles. A 300Mhz clock (300, 000, 000 cycles per second) would mean we finish 1000 hashes in.185 seconds. This is pretty slow. - There are some Pipelined functions that Vitis was able to apply automatically.
- There are "II Violations" (iteration interval) due to a carried dependency. A store operation and a load operation are happen at the same time. This will need to be fixed. Clicking on the guidance will message will take you to the code location it thinks is the cause. Most times I find it is a layer or more deeper (potentially due to inlining).
#pragma HLS pipeline
is critical to optimize to gain the most from our logic.
When we implement a program in an FPGA the logic turns into wires connecting LUTs, FFs, DSPs, and other components throughout the chip. A piece of data only occupies a small section of that hardware at any given time. If we imagine our FPGA is like a factory with an assembly line inside, we want to put new data onto that assembly line as soon as the last data has move to the next stage. Having many data items being processed along the circuit path at the same time increases efficiency and leads to high parallelism.
The ideal scenario is to achieve an II (Iteration Interval) of 1, which means a new piece of data starts processing down the assembly line every clock cycle. In practice we need to balance parallelism with resource availability.
With the current code just putting a pipeline at the entry function will only cut our latency in half while requiring 4.3x more DSPs than our FPGA has available.
Start by rethinking the way the code is written so that pipelining it more efficient.
Reducing PathsLogic turns into circuit paths, the more variations of the logic we have the more circuit paths need to be created. This if
/ else if
/ else
ends up creating three circuit paths through the chip, but data will only ever fall into one of the paths leaving the other two idle.
static state_2_standard apply_round_matrix_2_standard(const Fr constants[373], state_2_standard s) {
if (s.current_round == 3) {
s = apply_matrix_2_standard((const Fr (*))(constants + 89), s);
} else if ((s.current_round > 3) && (s.current_round < 4 + 55)) {
int index = s.current_round - 3 - 1;
s = apply_sparse_matrix_2_standard(constants + 98 + (index * 5), s);
} else {
s = apply_matrix_2_standard((const Fr (*))(constants + 80), s);
}
return s;
}
If we rewrite this to combine the apply_matrix_2_standard(...)
we can get higher utilization and thus high efficiency from the circuit.
static state_2_standard apply_round_matrix_2_standard(const Fr constants[373], state_2_standard s) {
if ((s.current_round > 3) && (s.current_round < 4 + 55)) {
ap_uint<7> index = s.current_round - 3 - 1;
ap_uint<9> index2 = index * 5;
s = apply_sparse_matrix_2_standard(constants + 98 + (index2), s);
} else {
Fr *matrix;
if (s.current_round == 3) {
matrix = (Fr (*))(constants + 89);
} else {
matrix = (Fr (*))(constants + 80);
}
s = apply_matrix_2_standard(matrix, s);
}
return s;
}
Write It In A Different Wayhash_2_standard
has a few issues:
- Putting a
#pragma HLS pipeline
at the top of the function cause too high of resource requirements. - Putting
#pragma HLS pipeline
at the top of every loop helps to pipeline the loop but to exit the loop requires the pipeline to empty. Given the dataflow_loop is not pipelined we only see one piece of data moved down the entire pipeline at any given time. - We see the duplicate paths inefficiencies with
full_round_2_standard
showing up in two places, along withlast_full_round_2_standard
being very similar to a full round.
static void hash_2_standard(Fr constants[373], hls::stream<state_2_standard> &preimage_stream, hls::stream<Fr> &digests, ap_uint<32> size) {
dataflow_loop:for (ap_uint<32> i = 0; i < size; i++) {
#pragma HLS loop_tripcount min=1000 max=1000
state_2_standard s = preimage_stream.read();
s = add_full_round_keys_2_standard(constants, s);
for (ap_uint<3> i = 0; i < 4; ++i) {
s = full_round_2_standard(constants, s);
}
for (ap_uint<6> i = 0; i < 55; ++ i) {
s = partial_round_2_standard(constants, s);
}
for (ap_uint<2> i = 0; i < (4 - 1); ++ i) {
s = full_round_2_standard(constants, s);
}
s = last_full_round_2_standard(constants, s);
digests << s.elements[1];
}
}
We can try to fix the duplication first. Each full, partial, and last round looks very similar to each other.
static state_2_standard full_round_2_standard(const Fr constants[373], state_2_standard s) {
quintic_s_box_loop: for (ap_uint<2> i = 0; i < 3; ++i) {
s.elements[i] = quintic_s_box(s.elements[i], Fr_ZERO, (constants + 1)[s.rk_offset + i]);
}
s.rk_offset += 3;
s = apply_round_matrix_2_standard(constants, s);
s.current_round += 1;
return s;
}
static state_2_standard last_full_round_2_standard(const Fr constants[373], state_2_standard s) {
quintic_s_box_loop: for (ap_uint<2> i = 0; i < 3; ++i) {
s.elements[i] = quintic_s_box(s.elements[i], Fr_ZERO, Fr_ZERO);
}
s = apply_round_matrix_2_standard(constants, s);
return s;
}
static state_2_standard partial_round_2_standard(const Fr constants[373], state_2_standard s) {
s.elements[0] = quintic_s_box(s.elements[0], Fr_ZERO, (constants + 1)[s.rk_offset]);
s.rk_offset += 1;
s = apply_round_matrix_2_standard(constants, s);
s.current_round += 1;
return s;
}
We can combine this code to get a single function that represents all rounds.
static state_2_standard all_rounds_2_standard(const Fr constants[], state_2_standard s, ap_uint<1> full, ap_uint<1> partial, ap_uint<1> last) {
state_2_standard s2;
s2.current_round = s.current_round;
loop_quintic: for (ap_uint<2> i = 0; i < 3; ++i) {
Fr post_add;
if (full || partial) {
post_add = (constants + 1)[s.rk_offset + i];
} else {
post_add = Fr_ZERO;
}
if (i == 0 || full || last) {
s2.elements[i] = quintic_s_box(s.elements[i], Fr_ZERO, post_add);
} else {
s2.elements[i] = s.elements[i];
}
}
if (full) {
s2.rk_offset = s.rk_offset + 3;
} else if (partial) {
s2.rk_offset = s.rk_offset + 1;
} else {
s2.rk_offset = s.rk_offset;
}
state_2_standard s3 = apply_round_matrix_2_standard(constants, s2);
s3.current_round = s.current_round + 1;
return s3;
}
All rounds flow through the same code with minor circuit branching to determine the unique requirements of each round type. Since we combined this code the pipeline will stay more saturated with data.
Back to hash_2_standard
. This function needs to do one of three things:
- Read data from the in stream (consume).
- Compute a single round (out of 63 total rounds).
- Write data to the out stream (produce).
if (round[index] == 0) {
add_full_round_keys_2_standard(constants, in.read());
} else if (round[index] >= 1 && round[index] <= 63) {
all_rounds_2_standard(constants, s[index], full, partial, last);
} else {
out.write(s[index].elements[1]);
}
Add state management to track:
- Previously read data from the stream.
- When data needs to be computed and if that computation is a full, partial, or last round.
- When data can be written to the stream.
Now we can create a cleaver single loop structure. The main_loop
gets more iterations, 65x more to be exact (1 to read, 63 to compute, and 1 to write).
static void hash_2_standard(const Fr constants[], hls::stream<state_2_standard> &in, hls::stream<Fr> &out, ap_uint<32> size) {
state_2_standard s[512];
ap_uint<7> round[512] = {0};
ap_uint<9> index = 0;
ap_uint<32> totalReads = 0;
ap_uint<32> fitToProcessingSize = (size >> 9) << 9;
if (fitToProcessingSize < size) {
fitToProcessingSize += 512;
}
ap_uint<39> processSize = fitToProcessingSize * 65;
main_loop: for (ap_uint<32> i=0; i < processSize; i++) {
#pragma HLS loop_tripcount min=65000 max=65000
#pragma HLS PIPELINE
if (round[index] == 0) {
if (totalReads < size) {
s[index] = add_full_round_keys_2_standard(constants, in.read());
round[index] = 1;
totalReads = totalReads + 1;
}
} else if (round[index] >= 1 && round[index] <= 63) {
ap_uint<7> currentP = round[index];
ap_uint<1> full = ((currentP >= 1 && currentP <= 4) || (currentP >= 60 && currentP <= 62));
ap_uint<1> partial = (currentP >= 5 && currentP <= 59);
ap_uint<1> last = (currentP == 63);
s[index] = all_rounds_2_standard(constants, s[index], full, partial, last);
round[index] = round[index] + 1;
} else {
out.write(s[index].elements[1]);
round[index] = 0;
}
index++;
if (index >= 512) {
index = 0;
}
}
}
The loops purpose is only to stop processing when all of the data has completed, it doesn't drive any logic. We really created a pseudo loop with the index
. The index determines which piece data in our s[512]
array will be processed in that iteration. The round[512]
array then keeps state of how many times that data has had a round applied and is used to determine the type of round to apply next.
Why does the new hash_2_standard
function use an array of 512 in size? Couldn't we just pipeline one piece of data 63 times and then write it out?
Data dependencies prevent us from doing another operation on a piece of data before the previous operation has finished.
When the all_rounds_2_standard
is called it takes 180 to 200 cycles to complete (depending on which optimizations we have applied). We don't want to wait for that data to finish, instead we put other pieces of data down the pipeline. The state array is sized to allow for sufficient time for calculations on the pipeline to finish and write back to the array index before the next time we need to read from the array index. This is called a loop-carried dependency in which a RAW (Read-After-Write) dependency is possible.
The other type of dependency is a loop-independent dependency, where a set of accesses happens multiple times in a loop iteration.
#pragma HLS dependence
can be used to provide additional information about dependencies that may help to improve pipelining at lower II. We didn't need this #pragma because we built in a natural latency to prevent the dependency.
Couldn't we escape early if the cache state wasn't fully used?
If we had an input that didn't fully fill the 512 slots of the array and we built in some type of early loop wrap to reduce the wasted cycles we would no longer be guaranteeing the RAW condition won't be violated.
With large datasets (32 GB) the number of clock cycles potentially wasted on the last unfilled cache is inconsequential to the overall runtime.
AnalysisWe are still not fully pipelined. The low level functions didn't pipeline which means we need to wait those number of cycles for each and every piece of data. The main_loop II got a lot better but we also increased the trip count by 65x so the overall latency didn't improve much.
We need to do more optimization starting in scalar_product
. From the guidance:
Inlining allows a function to be combined into a calling function for the purposes of sharing and optimizing the logic with its callers logic.
When a function is inlined any specified pragma is dissolved and becomes ignored.
We can try inlining the functions that are resisting pipelining in our current code. Specify #pragma HLS inline
in:
scalar_product
apply_sparse_matrix_2_standard
apply_round_matrix_2_standard
There is an amazing improvement in performance with an II=1. We are now sending a new piece of data down the pipeline every cycle. If this FPGA were running at 300Mhz we would be calculating 300 million rounds per second with 65 rounds per piece of data we would be processing around 4.6 million complete hashes per second.
There is a catch, the balance between parallelism and area has been violated. If we click the "Show Resource Utilization Percentage" button we can see we are using 182% of the FPGAs available DSPs. The DSPs are getting heavy used by apply_matrix_2_standard
and quintic_s_box
.
Let's focus on area optimizations.
#pragma HLS bind_opTo reduce resource usage we could try moving some of the multiply operations from DSPs to fabric (LUTs). I've seen negative timing slack and timing violations when I've tried this. More exploration is required.
#pragma HLS allocationTo reduce resource usage we need to expand II and potentially limit certain resource usage withing the logic. By applying #pragma HLS allocation
as follows:
quintic_s_box
#pragma HLS allocation operation instances=mul limit=15
apply_matrix_2_standard
#pragma HLS allocation operation instances=mul limit=30
all_rounds_2_standard
#pragma HLS allocation operation instances=mul limit=30
We are directing Vitis to reduce some of the allocated circuits and instead reuse the ones it has created. The side effect is a longer II=12.
With this optimization we are processing a new round every 12 clock cycles. 65 rounds per piece of data we are finishing a hash every 780 clock cycles. If our FPGA is running at 300Mhz then we are processing 384, 615 hashes per second.
Pragma Applied For UsSome of the pragma that commonly get apply automatically by the compiler. See #pragma documentation for details.
#pragma HLS aggregate
#pragma HLS expression_balance
#pragma HLS loop_merge
#pragma HLS unroll
The final step is testing on hardware. This compile takes the longest and consumes the most system RAM. Using the run icon when compiled for hardware will physically run your kernel and data on your FPGA.
Finding The Right BalanceIt can be a difficult journey getting the correct parallelism vs area. If you want a challenge see if you can improve on II=12 while still being able to build the RTL for hardware.
In ConclusionNice Work! We went from code to a fully optimized FPGA implementation.
See Project Code section (bottom of this page) "krnl_poseidon_arity2_final.cpp".
Comments