Below is a brief summary of what is required to accelerate cryptographic hashing that is an integral part of the Solana proof of history algorithm. The accelerator project is ongoing and there is the build process in its entirety is not completely automated. Hence it is necessary to follow the steps as described below.
Different flavours and versions of Linux may have different dependency requirements. My development and deployment process is based on Debian testing distro which is not officially supported by Xilinx. I had to fix some minor issues with Xilinx DMA drivers to make them work with kernel 5.16.0. Those fixes have not yet been fully tested but even so, given how narrow the number of supported configurations are for XRT (in particular its incompatibility with the CUDA toolkit) and how unstable XRT kernel drivers are on unsupported systems, I made a design choice not to use XRT to enable wider adoption of this accelerator solution.
Why accelerate SolanaSolana is a mainstream proof-of-stake blockchain offering high transaction throughput and low-cost transactions. Solana uses the SHA-256 hash function to provide a decentralised clock for its network in a protocol that is called proof of history. Following this protocol, the hash function is applied iteratively to a previous checkpoint hash tens of thousands of times per second to obtain the next checkpoint hash. Apart from generating checkpoint hashes, validators also have to verify hashes generated by thousands of other validators.
The demands of a growing network call for higher hash rates at the validator nodes and for lower computation latency. While high-end consumer processors, especially AMD Epycs and Threadripper Pros, in combination with large cache sizes offer native SHA-256 extensions such as sha_ni, on larger workloads they don't provide enough throughput. This has been addressed by Solana performance libraries in particular on Nvidia CUDA GPUs and processors with SIMD extensions. However, there have been denial of service incidents when the network crashed due to transaction load that it couldn't handle, for example due to the 17-hour outage.
Given the above, FPGAs offer a novel approach to proof-of-stake blockchain acceleration. Accelerating SHA-256 hashing is only the first step.
CoolingEven though passive cooling is insufficient to sustain heavy workloads, for some reason the card comes without active cooling and requires a 3rd party or DIY cooling solution in most cases. This has to be done before the card is installed.
My current DIY solution is the AC Infinity 5V USB blower fan with a PCI 180-degree angle connector:
At only 17 dB noise level, it allows to keep average ambient temperatures at 42 C on the FPGA and 37 C on the HBM.
A few days before writing this report a 3rd party solution became available: Osprey active air cooling.
Software componentsYou will need
- Vivado and Vitis HLS -- for RTL implementation, bitstream generation and FPGA programming.
- XDMA driver -- for sending data to the card and for receiving computation results from it.
- Warp Shell -- the baseline PCIe FPGA design alternative to the XRT shell but extendable in Vivado.
- Solana, or more precisely its experimental branch with FPGA support.
- Varana -- Varium C1100 accelerator for Solana.
This guide assumes an existing Vivado installation that includes Vitis HLS. Its installation procedures are explained on the Xilinx website. I used version 2021.2. Later versions may need updating a few commands.
Install and fine-tune the XDMA driver- Clone the Xilinx DMA driver repo.
- If working with the kernel 5.16 or newer, you may check the status of this pull request.
- Build and install the xdma.ko kernel module.
- Clone the repo.
- Run a script to allow user access to the XDMA driver
sudo sw/scripts/set-xdma-perms.sh
- Make sure that your user is a member of dialout group for the above script to take effect.
- Time to start the Vivado project. Still in the Warp Shell repo, we need to initialise the baseline project that we are going to extend with a proof-of-history core further down.
mkdir proj
cd proj
vivado -mode batch -source ../hw/scripts/varium_c1100_xdma_gen4_x8_custom.tcl
- Inspect the baseline project with Vivado.
vivado varium_c1100_xdma_gen4_x8_custom/varium_c1100_xdma_gen4_x8_custom.xpr
Now we have a fully working FPGA project. Try to generate the bitstream and program the card. After programming, soft reboot the host PC. Do not cold reboot since that will forget the image you just programmed.
Next steps are to install Rust programming language and try examples and benchmarks as explained in the warp-devices crate readme. The examples and benchmarks will work with the baseline FPGA design. For example, you will see the DMA read/write performance figures and will be able to access card info and monitoring values such as temperatures and voltages.
Build the Solana accelerator FPGA design- Clone the Varana repo.
- Synthesise and export the POH design:
cmake -S . -B build
cd build
make csynth_poh_f450
faketime -f '-1y' make export_poh_f450
This may take a few hours depending on your CPU. Vitis HLS does a lot of repetitive tasks in a single thread and doesn't make parallelise the build process effectively.
- In Vivado, add this build directory to the IP catalog. This will make our IP core called Poh available.
- Add the IP core Poh to the baseline design and connect it as follows:
- Instruct the address editor to assign memory addresses automatically. This should result in the following assinment:
- Implement this design, generate the bitstream. Leave Vivado running for several hours to complete this. It might be a good idea to run this process overnight.
- Program the FPGA. Perform a soft reboot of the host.
I started a proof of concept Rust crate to support Varium C1100. A way to use it is by running the Solana POH benchmark with the new flag that enables FPGA benchmarking.
git clone https://github.com/vkomenda/solana.git
cd solana
git checkout perf-libs-research
cargo run --package solana-poh-bench -- --fpga
The crate is an example of how IP cores can be supported by Rust code on Warp Shell. The device support module poh_core.rs provides a trait interface PohCoreOps
that allows to initialise the core and run it. Sending inputs to the core and receiving outputs from it is currently done outside this interface but this is likely to change as the functionality of the interface becomes more complete.
Another notable aspect is the use of memory-mapped AXI interface in the POH core as opposed to a streaming AXI interface. The memory-mapped AXI interface is used to allow the high-bandwidth memory (HBM) tightly coupled with the FPGA to act as a DMA buffer for POH core inputs and outputs. According to this design, the following needs to happen to compute proofs of history:
- The host prepares the input vectors for the card and places them in buffers aligned to the DMA page size.
- The host waits until the POH core becomes IDLE.
- The host uses the register-based control interface to set up input parameters.
- The host initiates a DMA transfer of the HBM input buffers to the FPGA.
- The host sends the START command to the POH core.
- The core reads inputs from the HBM, computes results and places them in the output buffer on the HBM.
- The host waits until the core reaches the DONE state.
- The host initiates a DMA transfer of the outputs from the HBM output buffer.
This design was straightforward to implement in Vitis HLS. However the HBM introduces IO latency. Ideally a streaming interface would allow to reduce the latency between the host and the card by eliminating the costly read/write HBM cycles. The streaming AXI primitives in Vitis HLS are more suitable for SIMD operations and do not allow detachable threads in the design. Which is why an alternative to Vitis HLS called TAPA is a more likely solution to continue the research into streaming interfaces since it does allow detachable threads and compile times in it are orders of magnitude faster.
AcknowledgementsI'm grateful to Quarky for starting Warp Shell. Also I'm indebted to other users of the Hackster Adaptive Computing Challenge and FPGA Discord groups for advice and encouragement, and to the contest team at Hackster for their support.
Comments