Given the rise of hardware acceleration in FPGA designs for applications such as machine learning and artificial intelligence, I thought it would be a good time to peel back a few layers of the onion and talk about the basics of passing data back and forth between the HDL code running in the programmable logic (PL) of an FPGA and the corresponding software running on a physical or soft processor in the FPGA.
Hardware acceleration can be boiled down to the basic idea of implementing certain functions in hardware (aka the programmable logic of an FPGA) that were previously being run in software that is either located on a host PC or running on a processor instantiated within the FPGA. Therefore, a good command of the knowledge of how to pass data back and forth between hardware and software is required to be an efficient designer.
In this particular case, I'm using a Zynq SoC (System on Chip) FPGA that has a physical ARM-core processor instantiated within the programmable logic. This ARM-core and supporting hardware is referred to as the processing system or PS.
While there are a few different ways to accomplish a data transfer between the PL and PS, including writing your own custom interface, I would argue that the most common mechanism is via a direct memory access (DMA) transfer. This is because DMA allows for the CPU of the ARM-core to simply initiate a data transfer between itself and DDR without the CPU having to wait for the transfer to complete before performing any other tasks. DMA also allows for the CPU to initiate a transfer between an external device and the DDR.
In this project, I'm demonstrating the functionality of DMA in the most basic way possible by using the Xilinx DMA IP block that converts memory map interface to a streaming interface via AXIS buses. I'm writing a series of 32 bytes to memory in embedded C, then transferring it to the PL via the memory map to stream (MM2S) AXIS, running each value through a register, then streaming the data back to memory via the stream to memory map (S2MM) port of the DMA IP block.
While this example is far too simple for heavy-duty hardware acceleration applications, that level of high speed data transfer can get very complex/overwhelming to learn when new to FPGAs. I felt like this project is a much needed introduction to the use of DMA and its behavior. And while I intended on this project focusing more on the aspect of data processing, I found enough little "gotchas" in the DMA transaction implementation that I'll have to leave the data processing focus to another project.
There are two main layers to controlling the data transfer between the HDL in PL and C code in the PS using AXI DMA:
1. The AXI stream handshaking signals in the HDL code of the PL on the Memory Map to Stream (MM2S) and Stream to Memory Map (S2MM) channels (The control channels of the DMA are written to using plain AXI, but this is all handled automatically by Vivado so I'm only focusing on the AXI stream interface here).
2. The sequence of register reads/writes to the DMA in the C code of the PS.
AXI Stream Handshaking in VerilogThe AXI stream interface is a straightforward set of handshaking signals used for data exchange in embedded designs. There are many optional signals in the AXI stream interface but the relevant and required ones for DMA MM2S and S2MM data exchanges are tdata, tvalid, tready, tlast, and tkeep. AXI stream refers to the entity sending data as the master and the entity receiving data as the slave in the interface.
- tdata: the data bus
- tvalid: asserted by the master interface when the data it has placed on the tdata bus is valid
- tready: asserted by the slave when it is in a state ready to receive data on the tdata bus
- tlast: asserted by the master for the duration of the last packet in the stream on the tdata bus to tell the slave no data will be following that packet
- tkeep: a secondary validation of the packets on the tdata bus set by the master indicating whether a packet is a part of the stream or not
Exactly how the AXI DMA IP implements this handshaking interface for data transfers out of memory (MM2S) and into memory (S2MM) is quite fickle especially on the S2MM side... and that's the nicest term that doesn't contain any curse words in my mind at the moment.
What you need to know about the S2MM transaction of the AXI DMA can however be mostly summed up into a single sentence that should be in explicitly stated its user guide but instead is kinda hard to extrapolate from it: the S2MM transaction must be set up and kicked off by writing to the appropriate control registers in the DMA in the appropriate order before you attempt to send any data to it and the S2MM channel will stop the transaction once it sees the tlast signal.
Data transfers happen on the tdata bus in S2MM and MM2S transactions each clock cycle where both tready and tvalid are asserted (true). So you have to be careful on the master side of the AXI interface when you're in charge of asserting tvalid that you don't leave tvalid asserted for more than one clock cycle when the incoming tready signal from the slave is also asserted for the same data word on the tdata bus. Otherwise the slave will clock in the same data packet twice as two separate data packets. And because you have to specify how many bytes are in a transfer in the control registers, the DMA channel (S2MM in this case) will think the exchange is over before it sees the tlast signal is provided since the count is now off and it will hang.
I wrote a simple state machine in Verilog that implements a slave AXI stream interface to receive data from the MM2S channel of the DMA, pass each data packet in the stream through a register, then implements a master AXI stream interface to send the data stream back to the S2MM channel. The register the data from the tdata bus is being passed through is meant to serve as the placeholder for where any custom data processing would happen for hardware acceleration.
I took a screenshot from the ILAs in Vivado about to show the timing diagram I implemented with the state machine. The top AXI stream is the MM2S side and the bottom is the S2MM side.
Here's the flow diagram of the Verilog state machine and the actual file is attached to the project at the end. It's important to note that the master/slave interfaces in the flow diagram are from the perspective of my Verilog state machine. This is something that I can get myself turned around on easily.
I simply created a block design in a new Vivado project targeting one of my Zynq-based FPGA development boards, and added an AXI DMA IP block after adding the Zynq PS IP and running block automation to apply the board presets for my development board. The Zynq PS also needs one of the high performance AXI ports enabled under PS-PL Configuration > HP Slave AXI Interface > S AXI HP0 interface for the DMA to be able to access the DDR.
Block automation and connection automation take care of all of the connections between the Zynq IP and DMA IP.
For the specific settings of the DMA IP, I unchecked the option for scatter gather so I'm using the DMA in direct register mode. Then I left everything else as the default settings and checked the option to allow unaligned transfers which I've found gives me a bit more wiggle room when writing a custom AXI stream interface to the DMA.
To add my Verilog state machine to the block design, I right-clicked in a blank area of the block design and selected the Add Module... option, which will show you all of the valid Verilog modules Vivado can find in your design source files to use in the block design.
It's import to note that my signal naming convention follows the standard of "s_axis" and "m_axis" for the slave and master interfaces respectively. This is import for the block design to be able to detect that it is indeed an AXI stream interface and allow me to connect it to the AXI stream ports on the DMA IP block.
I've covered the basic operation of DMA in the past from the perspective of using it from the Linux user-space, but this time I'm using it at a lower level directly from HDL in the programmable logic and bare metal embedded C. So here's the more straightforward sequence when using DMA from a bare metal user-space:
1. Reset the DMA by writing a 1 to bit 2 of the MM2S (offset 0x00) and S2MM (offset 0x30) control registers.
2. Write the destination address of the location in the DDR the S2MM channel is to write data to, to the S2MM DMA destination address register (offset 0x48).
3. Start the DMA S2MM channel by writing a 1 to bit 0 of the S2MM control register (offset 0x30).
4. Write the length of the buffer for the S2MM channel by writing the value for the total number of bytes to read into memory on the S2MM channel to the S2MM buffer length register (offset 0x58). This kicks off the S2MM transfer such that the DMA is prepared to receive a data stream from a device in the FPGA logic (which doesn't actually start until it is actually fed data and tvalid on the AXI stream bus is asserted by the device in logic).
5. Write the source address in the DDR of the data the MM2S channel is to read from to the MM2S DMA source address register (offset 0x18).
6. Start the DMA MM2S channel by writing a 1 to bit 0 of the MM2S control register (offset 0x00).
7. Write the length of the transfer from the MM2S channel by writing the value for the total number of bytes to send out to the MM2S transfer length register (offset 0x28). This kicks off the MM2S transfer from the DMA to the receiving device in the FPGA logic.
Remember what I mentioned before about the S2MM channel having to be kicked off and running before the device in PL tries to send data to it? Well that's why I have the steps above in the sequence that I do. Steps 2 - 4 configure and kick off the S2MM channel while steps 5 - 7 configure and kick off the MM2S channel.
It's okay to have some other processes happen between steps 4 and 5, but steps 2 - 4 MUST occur before steps 5 - 7. Once step 4 has been executed, the S2MM AXI stream channel will assert its tready signal, at which point your HDL code can start sending it data.
I created separate source files for controlling the DMA from the bare metal userspace attached below so they can be easily imported and reused in any Vitis application project, then the main file implements this demo project. You can find the sequence of steps above implemented in dma_controller.c
And this also explains something I noticed in the example DMA projects in SDK/Vitis when I first started using DMA. I always though it was backwards that the example code appeared to attempt to pull data into the DDR (by performing the S2MM - XAXIDMA_DEVICE_TO_DMA transfer first) before anything had been written out from it with the MM2S - XAXIDMA_DMA_TO_DEVICE transfer. However, the S2MM channel has to be ready and waiting to receive data in order to work properly and not lock up.
Anyways, hopefully this brain dump of DMA transfers is helpful. DMA seems to be a tricky method to get started with in FPGA design, but it's super helpful once you figure it out.
Comments