IPFS is a decentralized file system. Files and parts of files are stored on distributed machines. Each file is referenced by a unique hash created from the contents of the file or chunk of a file.
Whole-slide imaging is a technology where an entire pathology slide is digitized and stored. The image created may be at different zoom levels or different depth levels. The image is typically at a very high resolution. The result is that the file is often very large, on the order of 10's of GigaBytes.
Storing whole slide images on IPFS has some advantages. Hashes that are used as indices for the image chunks act as unique fingerprints and prevents file tampering The image will be partitioned into chunks that allow for requesting software to retrieve only the chunks that are needed, reducing bandwidth requirements and latency. IPFS allows for caching on the local machine so that image chunks don't need to be retrieved multiple times.
It is important to determine the computationally intensive tasks involved in chunking up whole-slide images for use on IPFS. Here is a list of some of the most intensive tasks:
1. Hashing (SHA256) of chunks to determine unique hash CID.
https://en.wikipedia.org/wiki/SHA-2
2. Hashing (SHA256) of chunks to verify the authenticity of a given chunk.
3. Whole-slide images are often compressed to save on disk space. Zarr https://zarr.readthedocs.io/en/stable/index.html is a commonly used whole-slide imaging file type and includes compression techniques like Blosc, Zstandard, LZ4, Zlib, BZ2 and LZMA.. Any and all of these compression types can be accelerated for both compression and decompression.
4. Files in IPFS often use Protocol Buffers https://developers.google.com/protocol-buffers to self describe and encode the data. Translating raw data to and from Protocol Buffers is a computationally challenging task.
5. Large sets of data or even directories are encapsulated in Merkle-DAGs. https://docs.ipfs.io/concepts/merkle-dag/ These create a tree of data that allows for checking of integrity of all of the data in the set.
6. IPFS has a tendency to represent numbers in multiple different number bases. Humans prefer base 10, computers prefer base 2, computer engineers base 16. But hashes are often represented in odd bases like Base58 https://tools.ietf.org/id/draft-msporny-base58-01.html. Translating hashes from one base to another for large datasets can be computationally expensive.
Xilinx has create a family of cards that can be used for computer data acceleration. This family is called Alveo https://www.xilinx.com/products/boards-and-kits/alveo.html, The Varium C1100 appears to be in the Alveo family. Each of these cards have a PCIe interface for host communication. They also contain a large FPGA and some memory. For the C1100 there is an on chip memory called HBM. Xilinx also provides tools for developing kernels and software to run the kernels. Two tools are Vivado and Vitis. Vivado is for hardware development and Vitis is for software development. Xilinx also provides a programming library for host development called XRT.
For this project, I focused on creating a kernel to address one of the targeted accelerations, converting base16 to base58.
Let's look at the steps along the way.
1. Card Installation
The physical installation was no problem. The car was installed in a Dell Workstation that was designed for high power PCIe cards. The workstation provided an auxiliary PCIe cable and adequate cooling was provided by the fans already n the case.
The card, once installed can be verified with lspci.
jallred@fpgadev:~$ lspci | grep Xilinx
0000:17:00.0 Processing accelerators: Xilinx Corporation Device 5058
0000:17:00.1 Processing accelerators: Xilinx Corporation Device 5059
At this point, the card needed to have some firmware updates. A couple of reboots later and everything was looking good.
Then the card can be checked with:
xbutil validate
This drives the card though the paces of host communication and reading and writing to HBM.
2. Creating a kernel
Xilinx provides a number of tools for creating kernels. Some are based on high level tools like C. I'm an RTL person, so I prefer to write my kernels in RTL (VHDL).
Writing a kernel in RTL can be done in Vivado. They have a template to get you started. The template is in Verilog, but Vivado allows for mixed languages. I kept the standard template given to me in Verilog and added my VHDL.
Here are the inputs and outputs for the kernel, in Verilog:
There are two basic connections shown here. First is an AXI-lite interface. This is a connection that will allow the host to talk to the register set of the kernel. It will allow the host to tell the kernel to run, read the status of the kernel, as well as send parameters like memory pointers and sizes.
The second interface is another AXI interface that allows the kernel to read and write the HBM.
When the kernel is "linked" the software will automatically connect the AXI-lite register connection to any and all linked kernels.
A hard HBM controller in the FPGA allows for up to 32 (sometimes 16) AXI interface connections from logic to the controller. Direction can be given as to how the kernel HBM interfaces are attached to the HBM controller. The host maintains one of the HBM connections, but the different kernels can be connected to different AXI ports as appropriate.
The RTL Kernel Wizard provides a Verilog module to create the registers that can be read and written from the AXI-lite connection. It is easy to modify that file to add additional parameters, as needed. An XML file allows for defining these registers so that host tools can identify the registers by name. In most cases the software engineer will be aware of the registers and can write code to read and write them without the need to interrogate.
I then added a module to hold my custom kernel.
In the case shown, we connect a clock and reset, the AXI bus to the HBM, and the standard control and status bits to run the kernel.
For my specific kernel (base32 to base58), the hashes are 256 bits and are loaded by the host into HBM. I then created a machine that reads the hashes from the HBM, processes them and returns the results to HBM.
It is important to note that the HBM has a 512 bit data bus. That means that every AXI beat reads 512 bits, or two hashes from HBM. Additionally, the HBM AXI bus allows for burst reads up to 64 beats. A single AXI burst can read 128 hashes from HBM.
An HBM burst was used as the basis for the algorithm. A burst request is made, followed by the read of the data, which goes into a couple dividers and other logic that do the work of converting to base58. At the end a lookup table is used to convert the results to the ASCII characters that are used in base58. The results are written HBM and the kernel status is changed to ap_done = 1; Base58 can have a variable number of bytes of data for a given 256 bit hash, so the time to run the algorithm can actually vary a little. A worst case scenario resulted in 81920 clocks to calculate the 128 hashes. That's 640 clocks per hash or about 1.25 microseconds per hash. If necessary, the Xclbin file could be linked with up to 32 of these kernels. Thats 25 million hashes converted per second.
3. Host software
Xilinx provides an XRT API. There are actually three APIs: OpenCL, C++ native, and C native. With the C++ native API it is easy to load the XCLBIN file (the kernel), allocate buffers in HBM, transfer data to and from HBM, read and write registers in the kernel, and start a kernel and look at status.
The tool to compile is the standard Linux gcc tools with library and header files provided by Xilinx,
4. Debug
Host software can be debugged using all the standard debugging tools for C++. You can use GDB or resort to printing a bunch of stuff out.
For kernel debug, there are two basic options.
The first is to debug the AXI interfaces to the kernel. This one is pretty easy, when using v++ to link up the kernels, one can simply use --debug.chipscope <compute_unit_name>:<interface_name>. This works for both the AXI-lite interface to registers and for the AXI interface to HBM. This works really well for making sure the registers are built correctly or that the kernel is accessing HBM correctly,
To do a generic chipscope debug of a kernel is not so straight forward. First, instantiate an ILA in the kernel using the IP catalog. Assign the signals that are desired to be watched to the ILA, and Generate the RTL Kernel.
In order to have the ILA attached to the Debug Bridge in the static portion of the shell, needs some XDC hand-holding.
connect_debug_cores -master [get_cells -hierarchical -filter {NAME =~ *debug_bridge_xsdbm/inst/xsdbm}] -slaves [get_cells -hierarchical -filter {NAME =~ */i_ila_1}]
i_ila_1 is the instantiated ILA. The connect_debug_cores needs to be in an XDC file that is included in a TCL script:
read_xdc /home/jallred/kernels/connect_debug_core.xdc
set_property used_in_implementation TRUE [get_files /home/jallred/kernels/connect_debug_core.xdc]
set_property PROCESSING_ORDER EARLY [get_files /home/jallred/kernels/connect_debug_core.xdc]
The the tcl script needs to be passed to v++
.v++ -t hw -g --platform xilinx_u55n_gen3x4_xdma_1_202110_1 --link first.xo --advanced.param compiler.userPostDebugProfileOverlayTcl=./post_dbg_profile_overlay.tcl -o a.xclbin
Conclusion
I had hoped to develop a number of kernels related to IPFS and whole-slide imaging. Unfortunately, the time with the card was limited and the time to learn the tools was significant. I've shown that the Varium C1100 card is up to the task of accelerating algorithms that are applicable to whole-slide imaging.
Comments