The Zynq's FPGA aka programmable logic (PL) is a powerful tool for accelerating computationally intensive operations. In order to integrate a custom accelerator into a system based on the Genode OS Framework, we need to transfer data to/from the PL. Since we want to release the CPU, this is best done efficiently with DMA transactions. In this project, I am going to build a loopback design for the Zybo Z7-20 using Xilinx' AXI DMA IP Core, and write a Genode component that programs the IP core.
The Zynq SoC has various AXI interfaces, one of which is the AXI_HP (high performance) interface that provides the highest throughput to the memory interconnect. Another interface is the AXI_ACP that connects to the accelerator coherency port and thereby allows cache-coherent memory access by the PL so that data does not need to be flushed from caches before handing over to the DMA core. As I am interested in the differences with respect to the resulting performance, I am going to create a design with two DMA cores, one connected to AXI_HP and the other to AXI_ACP, and compare the throughput results.
At the very end of this project, I provide instructions to run the resulting test scenario from the source code repository.
Implementing a DMA loopback design with VivadoFor this project, I am using Vivado 2022.2. As always, I created a new project for the Zybo Z7-20 board and added a block design. Next, I added the Zynq processing system (PS) IP to the block design and gave the block automation wizard a spin.
In the next step, I added an axi_dma IP core that I want to connect to the Zynq PS via the AXI_HP interface. For this, I customised the Zynq PS IP and enabled one of the AXI_HP interfaces on the "PS-PL Configuration" page. After adding the axi_dma IP to the block design, I disabled its scatter/gather engine and set the data width of the read and write channel to 64bit (for comparability to the AXI_ACP). Note that, by default, the maximum length of a DMA transfer is set to 16kB (2^14). Yet, in the customisation dialogue, one can set the size of the length field to any value from 8 to 26 bits. After closing the dialogue, I connected the AXI stream (AXIS) ports with each other to implement the data loopback. I then ran the connection automation wizard twice to connect the remaining AXI interfaces.
The wizard inserted an axi_mem_intercon
block, which I also needed to customise to set the data width to 64bit. I also set the optimisation strategy to "Maximize Performance". To check whether I have missed any obvious configuration, I hit the checkbox button on the toolbar to validate the design.
With this design, we could already transfer data to/from the FPGA. However, we would not receive any interrupts from the DMA IP. To support this, I enabled fabric interrupts by customising the Zynq PS and activating the IRQ_F2P
option on the Interrupts page. Since the IRQ_F2P
port expects a vectored input, I needed to add an intermediate concat IP block to which I connected the mm2s_intout
and s2mm_intout
single-bit interrupt signals of the DMA IP as described in detail here.
As I want to compare AXI_HP against AXI_ACP, I added a second axi_dma IP that I customised just as before. I enabled the AXI_ACP interface by customising the Zynq PS IP and also ticked the "Tie off AxUSER" option. When running the connection automation wizard, I only chose S_AXI_ACP
and selected M_AXI_MM2S
of axi_dma_1
as the master interface. This added a new axi_mem_intercon
instance between the Zynq's ACP port and the new DMA core. By running the wizard a second time, I could choose the remaining interfaces and made sure to select the S_AXI_ACP
as slave interface for M_AXI_S2MM
.
According to the Zynq-7000 TRM, one must set ARUSER/ARCACHE resp. AWUSER/AWCACHE accordingly to enable coherent accesses. The AxUSER signals are taken care of by ticking the corresponding option (see above). For the AxCACHE signals, I added a Constant IP block with width 4 and value 11 that I connected to the arcache
signal of the S00_AXI
port and the awcache
signal of the S01_AXI
port. The constant value 11 refers to bufferable (bit 0), cacheable (bit 1) and write allocate (bit 3).
In order to connect the interrupts signals of the second DMA core, I increased the size of the concat block to 4.
In a last configuration step of my design, I increased the frequency of the FCLK_CLK0
clock from the default 50MHz to 250MHz and validated the design.
Lastly, before generating the bitstream, I generated the HDL wrapper by right-clicking on the block design item in the Sources view.
Once having generated the bitstream, I exported the hardware via File → Export → Export Hardware and made sure to tick the Include bitstream option. The resulting.xsa file is basically a zip archive containing the bitstream and a few other files. In Genode, we need to let the platform driver know about the devices implemented by the bitstream and its properties (e.g. MMIO addresses, IRQ numbers, clock frequencies). Luckily most of the required information can be found in the.xsa file. I've therefore written a tool that extracts the necessary details and outputs XML chunks to be integrated into the platform-driver config.
Within a clone of the genode-zynq repository (instructions are available in this project), the tool is executed as follows:
repos/zynq #> ./tool/xpar_from_xsa axi_dma_loopback.xsa
####################################
Instance "axi_dma_0" of type axi_dma
####################################
<device name="axi_dma_0" type="axi_dma">
<io_mem address="0x40400000" size="0x10000"/>
<clock name="fpga0" driver_name="fpga0" rate="250000000"/>
<property name="XPAR_AXI_DMA__SG_INCLUDE_STSCNTRL_STRM" value="0"/>
<property name="XPAR_AXI_DMA__INCLUDE_MM2S" value="1"/>
<property name="XPAR_AXI_DMA__INCLUDE_MM2S_DRE" value="0"/>
<property name="XPAR_AXI_DMA__M_AXI_MM2S_DATA_WIDTH" value="64"/>
<property name="XPAR_AXI_DMA__INCLUDE_S2MM" value="1"/>
<property name="XPAR_AXI_DMA__INCLUDE_S2MM_DRE" value="0"/>
<property name="XPAR_AXI_DMA__M_AXI_S2MM_DATA_WIDTH" value="64"/>
<property name="XPAR_AXI_DMA__INCLUDE_SG" value="0"/>
<property name="XPAR_AXI_DMA__NUM_MM2S_CHANNELS" value="1"/>
<property name="XPAR_AXI_DMA__NUM_S2MM_CHANNELS" value="1"/>
<property name="XPAR_AXI_DMA__MM2S_BURST_SIZE" value="8"/>
<property name="XPAR_AXI_DMA__S2MM_BURST_SIZE" value="16"/>
<property name="XPAR_AXI_DMA__MICRO_DMA" value="0"/>
<property name="XPAR_AXI_DMA__SG_LENGTH_WIDTH" value="14"/>
</device>
####################################
Instance "axi_dma_1" of type axi_dma
####################################
<device name="axi_dma_1" type="axi_dma">
<io_mem address="0x40410000" size="0x10000"/>
<clock name="fpga0" driver_name="fpga0" rate="250000000"/>
<property name="XPAR_AXI_DMA__SG_INCLUDE_STSCNTRL_STRM" value="0"/>
<property name="XPAR_AXI_DMA__INCLUDE_MM2S" value="1"/>
<property name="XPAR_AXI_DMA__INCLUDE_MM2S_DRE" value="0"/>
<property name="XPAR_AXI_DMA__M_AXI_MM2S_DATA_WIDTH" value="64"/>
<property name="XPAR_AXI_DMA__INCLUDE_S2MM" value="1"/>
<property name="XPAR_AXI_DMA__INCLUDE_S2MM_DRE" value="0"/>
<property name="XPAR_AXI_DMA__M_AXI_S2MM_DATA_WIDTH" value="64"/>
<property name="XPAR_AXI_DMA__INCLUDE_SG" value="0"/>
<property name="XPAR_AXI_DMA__NUM_MM2S_CHANNELS" value="1"/>
<property name="XPAR_AXI_DMA__NUM_S2MM_CHANNELS" value="1"/>
<property name="XPAR_AXI_DMA__MM2S_BURST_SIZE" value="8"/>
<property name="XPAR_AXI_DMA__S2MM_BURST_SIZE" value="16"/>
<property name="XPAR_AXI_DMA__MICRO_DMA" value="0"/>
<property name="XPAR_AXI_DMA__SG_LENGTH_WIDTH" value="14"/>
</device>
These XML chunks contain all the details for the platform driver that I was able to extract from the.xsa file. Note that the tool also extracts a set of IP core properties that are needed by Genode's xilinx_axidma library (details below). What's missing, though, are the IRQ numbers, which require background knowledge of the platform and an expert eye on the block design. The fabric interrupts are mapped to IRQs 61 to 68 and 84 to 91. They are assigned in incrementing order starting from the lowest ID. Since I have connected four interrupt signals via the concat block to the fabric interrupts, I know that IRQs 61 to 64 are in use by the DMA cores. I thus added <irq.../>
nodes to the XML chunks as follows:
<device name="axi_dma_0" type="axi_dma">
<io_mem address="0x40400000" size="0x10000"/>
<irq number="61"/>
<irq number="62"/>
[...]
</devices>
<device name="axi_dma_1" type="axi_dma">
<io_mem address="0x40410000" size="0x10000"/>
<irq number="63"/>
<irq number="64"/>
[...]
</device>
The resulting XML chunks can be used for configuring the platform driver. In a previous project, I have introduced a subsystem that takes care of bitstream loading and platform driver configuration. This subsystem requires a device_manager.config ROM that specifies the available bitstreams and their devices. The process of creating the devices_manger.config is best done by creating a depot archive (Genode's software packaging system), which I describe in the next section.
Packaging the bitstream for GenodeFor bitstream packaging I added support for re-creating Vivado projects to Goa in a previous project. Goa is a command-line-based workflow tool for the development of applications for the Genode OS Framework. With Goa, bitstream packaging involves four preparatory steps:
- Creating a new Goa project.
- Exporting the Vivado project in form of a.tcl file.
- Copying the required source files.
- Creating a devices template.
I added a new Goa project to my goa-projects repository by creating a new sub-directory zynq/zybo_z720_dma_loopback-bitstream/ and populating the directory with a version file, a LICENSE file and an artifacts file. The version file specifies the archive version in yyyy-mm-dd
format, whereas the artifacts file contains a list of files to be packaged (one file per line). Knowing that the bitstream will be named after the project/directory name and that I want to generate a devices_manager.config file, my artifacts file has the following content:
zybo_z720_dma_loopback-bitstream.bit
devices_manager.config
In order to export a.tcl file from Vivado, I typed the following command into Vivado's tcl console:
write_project_tcl -paths_relative_to /home/johannes/axi_dma_loopback /path/to/goa-projects/zynq/zybo_z720_dma_loopback-bitstream/src/vivado.tcl
Note that I directly exported the vivado.tcl file into the src sub-directory of my Goa project. At the very top of the file (after the obligatory comment block), the function checkRequiredFiles
is defined as follows:
proc checkRequiredFiles { origin_dir} {
set status true
return $status }
Since I haven't added any source files or custom IP cores to the hardware design, there are no required source files. Note that you can find instructions on how to copy required source files to the Goa project in my previous project.
Last, I created a src/devices file in order to specify what devices Goa shall consider when generating the devices_manager.config. The easiest way to create the devices file is to execute the aforementioned xpar_from_xsa tool on the.xsa file as follows:
repos/zynq #> ./tool/xpar_from_xsa axi_dma_loopback.xsa | sed 's/value=".*"//' | sed 's/type=".*"//' | grep -v clock | grep -v io_mem
This command removes the value attributes from the property nodes and the type attribute from the device nodes. Moreover, it removes the clock and io_mem nodes. Adding the irq nodes as explained in the previous section and wrapping everything in a devices node, I ended up with the following content that I put into the src/devices file:
<devices>
<device name="axi_dma_0" >
<irq number="61"/>
<irq number="62"/>
<property name="XPAR_AXI_DMA__SG_INCLUDE_STSCNTRL_STRM" />
<property name="XPAR_AXI_DMA__INCLUDE_MM2S" />
<property name="XPAR_AXI_DMA__INCLUDE_MM2S_DRE" />
<property name="XPAR_AXI_DMA__M_AXI_MM2S_DATA_WIDTH" />
<property name="XPAR_AXI_DMA__INCLUDE_S2MM" />
<property name="XPAR_AXI_DMA__INCLUDE_S2MM_DRE" />
<property name="XPAR_AXI_DMA__M_AXI_S2MM_DATA_WIDTH" />
<property name="XPAR_AXI_DMA__INCLUDE_SG" />
<property name="XPAR_AXI_DMA__NUM_MM2S_CHANNELS" />
<property name="XPAR_AXI_DMA__NUM_S2MM_CHANNELS" />
<property name="XPAR_AXI_DMA__MM2S_BURST_SIZE" />
<property name="XPAR_AXI_DMA__S2MM_BURST_SIZE" />
<property name="XPAR_AXI_DMA__MICRO_DMA" />
<property name="XPAR_AXI_DMA__SG_LENGTH_WIDTH" />
</device>
<device name="axi_dma_1" >
<irq number="63"/>
<irq number="64"/>
<property name="XPAR_AXI_DMA__SG_INCLUDE_STSCNTRL_STRM" />
<property name="XPAR_AXI_DMA__INCLUDE_MM2S" />
<property name="XPAR_AXI_DMA__INCLUDE_MM2S_DRE" />
<property name="XPAR_AXI_DMA__M_AXI_MM2S_DATA_WIDTH" />
<property name="XPAR_AXI_DMA__INCLUDE_S2MM" />
<property name="XPAR_AXI_DMA__INCLUDE_S2MM_DRE" />
<property name="XPAR_AXI_DMA__M_AXI_S2MM_DATA_WIDTH" />
<property name="XPAR_AXI_DMA__INCLUDE_SG" />
<property name="XPAR_AXI_DMA__NUM_MM2S_CHANNELS" />
<property name="XPAR_AXI_DMA__NUM_S2MM_CHANNELS" />
<property name="XPAR_AXI_DMA__MM2S_BURST_SIZE" />
<property name="XPAR_AXI_DMA__S2MM_BURST_SIZE" />
<property name="XPAR_AXI_DMA__MICRO_DMA" />
<property name="XPAR_AXI_DMA__SG_LENGTH_WIDTH" />
</device>
</devices>
Goa is going to fill in the blanks, i.e. the I/O memory address, the clock frequency and the values of the stated properties, with the information that it finds in the.xsa file after re-creating the Vivado project and re-generating the bitstream.
With these preparatory steps, I was able to build and package the bitstream:
goa-projects/zynq/zybo_z720_dma_loopback-bitstream #> goa export --depot-user jschlatow
[...]
[zybo_z720_dma_loopback-bitstream] exported /home/johannes/repos/genode/depot/jschlatow/src/zybo_z720_dma_loopback-bitstream/2023-01-16
[zybo_z720_dma_loopback-bitstream] exported /home/johannes/repos/genode/depot/jschlatow/bin/arm_v7a/zybo_z720_dma_loopback-bitstream/2023-01-16
Note that I have configured Goa to export the archives directly into my Genode depot folder so that I can use the archives right away without publishing them before (see goa help config
for more details).
In order to test the DMA IP cores and compare their throughput results, I have implemented a generic Genode component. The resulting code can be found in the genode-zynq repository. In the scope of this story, I will run through a simplified implementation that demonstrates how to initiate DMA transfers to the axi_dma IP core in Genode.
I started by creating a new directory for the component in my zynq repo:
repos/zynq #> mkdir -p src/test/dma_loopback
Within this directory, I created a target.mk file with this content:
REQUIRES := arm_v7a
TARGET = test-dma_loopback
LIBS = base xilinx_axidma libc
SRC_CC += main.cc
This tells Genode's build system the name of the binary, the used libraries and the source files to compile. I also added a REQUIRES := arm_v7a
line that instructs the build system to never try building the component for architectures other than arm_v7a. The list of libraries is explained as follows. The base library is always required as it provides Genode's base API. In order to simplify driver development for Xilinx IP cores I have added Xilinx' embeddedsw repository as a port to Genode and implemented the xilinx_axidma library based on this port to take care of device initialisation and configuration of axi_dma IP cores. This library, e.g., reads the XPAR__*
property values from the device config to set up the device. Normally, these values are found hard-coded in an xparameters.h file. Since the Xilinx' embeddedsw code uses a few headers that are found in Genode's C-runtime, I also had to add libc to the list.
At this point, I can explain how the xpar_from_xsa tool decides what XPAR__ parameters to extract from the.xsa file. It simply looks for any XPAR__*
strings referenced by the code in the genode-zynq repository and extracts the corresponding values from the.xsa file.
Having set up the build system, I continued with implementing the main.cc file. At the top of the file, I started with the following includes:
/* Genode includes */
#include <libc/component.h>
#include <timer_session/connection.h>
/* Xilinx port includes */
#include <xilinx_axidma.h>
/* local includes */
#include <dma_ring_buffer.h>
using namespace Genode;
/* [...] see below */
Since I added the libc library as a dependency, I need to write a libc component and include the corresponding header. Moreover, I know that I need a timer connection. I also added an include directive for the xilinx_axidma library and a locally implemented ring of DMA buffers.
Let's continue with the implementation of the Main
object and its members:
/* [...] see above */
struct Main {
enum {
DMA_BUFFER_SIZE = 1024*1024,
ACCESS_SIZE = 64*30
};
Env &env;
Xilinx::Axidma axidma { env, Xilinx::Axidma::Mode::NORMAL };
Xilinx::Axidma::Transfer_complete_handler<Main> rx_handler {
*this, &Main::handle_rx_complete };
/* timer for throughput reporting */
using Periodic_timeout = Timer::Periodic_timeout<Main>;
Timer::Connection timer { env };
Constructible<Periodic_timeout> timeout { };
/* state for throughput test */
unsigned last_counter { 0 };
unsigned counter { 0 };
unsigned rx_counter { 0 };
unsigned iterations { 5 };
Dma_ring_buffer buffers { axidma.platform(),
DMA_BUFFER_SIZE,
UNCACHED };
/* [...] see below */
I don't want to go into every detail, since most of these are self-explanatory or become clear when looking at their use. The Xilinx::Axidma
class is provided by the xilinx_axidma library. In addition to the obligatory Env object, it gets a mode as parameter. The class creates a connection to a Platform service and acquires an axi_dma device. The mode determines whether to operate without interrupts (SIMPLE
), with interrupts (NORMAL
) or in scatter/gather mode (SG
). Note that scatter/gather mode is not yet implemented.
The Transfer_complete_handler
allows us to register the Main's handle_rx_complete
method as an IRQ handler.
In this example, I am initialising the Dma_ring_buffer
with the UNCACHED
attribute. When using the axi_dma core that is connected via the ACP, I could replace this with CACHED
. The Dma_ring_buffer
manages a circular set of DMA buffer pairs (TX and RX). This allows to perform read/write operations on DMA buffers while having another buffer pair assigned to the current DMA transaction.
Next, let's have a look at Main's methods:
/* [...] see above */
/* simple transfer test */
void test_simple_transfer(size_t, uint8_t);
/* methods for throughput test */
void handle_rx_complete();
void fill_transfers();
void queue_next_transfer();
Main(Env & env) : env(env)
{
test_simple_transfer(8192, 0x21);
/* prepare throughput test */
axidma.rx_complete_handler(rx_handler);
fill_transfers();
/* start periodic timer for throughput logging */
timeout.construct(timer, *this, &Main::handle_timeout,
Microseconds { 1000 * 2000U });
/* start throughput test */
queue_next_transfer();
}
void handle_timeout(Duration)
{
unsigned long transmitted = counter - last_counter;
last_counter = counter;
log("Current loopback throughput: ",
((transmitted * DMA_BUFFER_SIZE) / 2000000UL), "MB/s");
iterations--;
if (iterations == 0)
env.parent().exit(0);
}
};
/* [...] see below */
I added the method test_simple_transfer()
to initiate a single, blocking loopback transfer. For measuring the throughput, I added the aforementioned IRQ handler and two methods for filling the DMA buffers and for queueing a DMA transfer.
In Main's constructor, I first call test_simple_transfer()
to test whether simple DMA transfers work. Next, I initiate the throughput test by registering the IRQ handler, filling the DMA buffers, registering a periodic timeout for throughput logging and starting the test by queueing the first transfer. The timeout triggers the handle_timeout()
method every two seconds. The method calculates the average throughput in the past logging interval and exits the component after five iterations.
I implemented test_simple_transfer()
as follows:
/* [...] see above */
void Main::test_simple_transfer(size_t size, uint8_t value)
{
Platform::Dma_buffer src_buffer { axidma.platform(), size, UNCACHED };
Platform::Dma_buffer dst_buffer { axidma.platform(), size, UNCACHED };
/* initialise src buffer */
Genode::memset(src_buffer.local_addr<void>(), value, size);
Genode::memset(dst_buffer.local_addr<void>(), value != 0 ? 0 : -1, size);
log("initiating simple transfer of size ", (unsigned)size);
/* perform DMA transfer */
if (axidma.simple_transfer(src_buffer, size, dst_buffer, size)
!= Xilinx::Axidma::Result::OKAY) {
error("DMA transfer failed");
env.parent().exit(1);
}
/* compare buffers */
if (Genode::memcmp(src_buffer.local_addr<void>(),
dst_buffer.local_addr<void>(), size)) {
error("DMA transfer failed - Data error");
env.parent().exit(1);
} else
log("DMA transfer succeeded");
}
/* [...] see below */
The method allocates two DMA buffers of the provided size, fills the first one with the provided value and fills the seconds one with a different value. After that, it initiates a loopback transfer by calling axidma.simple_transfer()
. This method blocks until both transfers have completed. If any error occurred, the component will be exited. Otherwise, the content of both buffers is compared. If they are not equal, the component is exited.
Now, let's continue with the throughput test implementation:
/* [...] see above */
void Main::fill_transfers()
{
/* fill all buffers */
while (true) {
Genode::memset(buffers.head().tx.local_addr<void>(),
(uint8_t)counter,
ACCESS_SIZE);
*buffers.head().tx.local_addr<unsigned>() = counter;
if (buffers.advance_head())
counter++;
else
break;
}
}
void Main::queue_next_transfer()
{
/* start transfer */
if (buffers.empty()) {
warning("unable to queue transfer from empty ring buffer");
return;
}
Dma_ring_buffer::Dma_buffer_pair bufs = buffers.tail();
if (axidma.start_rx_transfer(bufs.rx, DMA_BUFFER_SIZE)
!= Xilinx::Axidma::Result::OKAY) {
error("DMA rx transfer failed");
env.parent().exit(1);
}
if (axidma.start_tx_transfer(bufs.tx, DMA_BUFFER_SIZE)
!= Xilinx::Axidma::Result::OKAY) {
error("DMA tx transfer failed");
env.parent().exit(1);
}
}
/* [...] see below */
In the fill_transfers()
method, I initialise the first ACCESS_SIZE
bytes of the TX buffer and make sure to store a counter value in the first word of the buffer. The counter value is increased whenever we could successfully advance the head of the ring buffer to the next buffer pair.
The queue_next_transfer()
method takes the tail of the ring buffer and initiates a separate DMA transfer for the RX and for the TX buffer. In contrast to the simple_transfer()
method, the start_rx_transfer()
/start_tx_transfer()
methods do not block until the transfer completed. Instead, the completion is indicated by the RX complete interrupt, which is handled by the following method:
/* [...] see above */
void Main::handle_rx_complete()
{
if (!axidma.rx_transfer_complete())
return;
/* compare the first ACCESS_SIZE bytes of src and dst buffers */
Dma_ring_buffer::Dma_buffer_pair bufs = buffers.tail();
if (Genode::memcmp(bufs.tx.local_addr<void>(),
bufs.rx.local_addr<void>(), ACCESS_SIZE)) {
error("DMA failed - Data error");
env.parent().exit(1);
}
/* check whether memory content has the expected value */
else if (*bufs.tx.local_addr<unsigned>() != rx_counter++) {
error("Expected ", rx_counter-1,
" but got ", *bufs.tx.local_addr<unsigned>());
env.parent().exit(1);
}
/* advance tail and initiate next transfer */
buffers.advance_tail();
queue_next_transfer();
fill_transfers();
}
/* [...] see below */
In the handle_rx_complete()
method, I compare the first ACCESS_SIZE
bytes of the TX and RX buffers. Furthermore, I check whether the first word has the expected counter value. If everything is as expected, the tail of the ring buffer is advanced, the next transfer is queued, and the old tail buffer is written with new data.
Last not least, the entrypoint for libc components must be defined. I did so by instantiating the Main object, which also completes the main.cc.
/* [...] see above */
void Libc::Component::construct(Env &env) {
static Main main(env);
}
With the implementation at hand, I could give the build system a spin and try compiling the component.
genode #> make -C build/arm_v7a test/dma_loopback
[...]
/home/johannes/repos/genode/contrib/xilinx_embeddedsw-ceeef20e4f3af9c01c24f44cb45a18e5a5defe1a/src/embeddedsw/lib/bsp/standalone/src/common/xil_printf.h:16:10fatal error: xparameters.h: No such file or directory
16 | #include "xparameters.h"
| ^~~~~~~~~~~~~~~
compilation terminated.
make[2]: *** [/home/johannes/repos/genode/repos/base/mk/generic.mk:58: xaxidma.o] Error 1
make[1]: *** [var/libdeps:556: test-dma_loopback.prg] Error 2
make: *** [Makefile:336: gen_deps_and_build_targets] Error 2
The embeddedsw port is looking for a xparameters.h file. Typically, this file contains the definition of all the XPAR__ parameters. Since I introduced a dynamic configuration mechanism in the xilinx_axidma library, I tried creating an empty xparameters.h file in src/test/dma_loopback/, which did the trick.
In order to execute the component in a test scenario, I further created a run/dma_loopback_test.run file (see attachment). In this file, I import a bunch of depot archives and build my test component. In the config for the top-level init component that defines the composition of the scenario, I instantiate a timer component, a vfs component, a platform-driver subsystem, and the test component itself. The vfs serves the bitstream file and a static config to the fpga driver running in the platform-driver subsystem. The latter is provided by the pkg/driver_fpga-zynq archive, which includes a drivers.config file. Instantiating the sub-system consists in starting an init component that takes drivers.config as configuration. The subsystem also requires a devices ROM, a devices_manager.config, and a policy ROM. The devices ROM is provided by the raw/[board]-devices archive ('[board]' is replaced by the build system with the target board). As mentioned when packaging the bitstream, the devices_manager.config is provided by the zybo_z720_dma_loopback-bitstream archive. The policy ROM, however, is specific to the scenario. I thus let the run script create this by writing the following content to a [run_dir]/genode/policy file:
<config>
<policy label="test-dma_loopback -> ">
<device name="axi_dma_0"/>
</policy>
</config>
This instructs the platform driver to provide the device named "axi_dma_0" to the test component.
With all these ingredients, I was able to run the test scenario successfully:
genode #> make -C build/arm_v7a run/dma_loopback_test BOARD=zynq_zybo_z7
...
[init] parent provides
[init] service "ROM"
[init] service "PD"
[init] service "CPU"
[init] service "LOG"
[init] service "IO_MEM"
[init] service "IRQ"
[init] child "timer"
[init] RAM quota: 872K
[init] cap quota: 166
[init] ELF binary: timer
[init] priority: 0
[init] provides service Timer
[init] child "vfs"
[init] RAM quota: 8040K
[init] cap quota: 166
[init] ELF binary: vfs
[init] priority: 0
[init] provides service File_system
[init] child "platform_drv"
[init] RAM quota: 24424K
[init] cap quota: 966
[init] ELF binary: init
[init] priority: 0
[init] provides service Platform
[init] child "test-dma_loopback"
[init] RAM quota: 8040K
[init] cap quota: 166
[init] ELF binary: test-dma_loopback
[init] priority: 0
[init] child "timer" announces service "Timer"
[init] child "vfs" announces service "File_system"
[init -> platform_drv -> fpga_drv] Loading bitstream zybo_z720_dma_loopback-bitstream.bit of size 0x3dbafc
[init -> test-dma_loopback] initiating simple transfer of size 8192
[init -> test-dma_loopback] DMA transfer succeeded
[init -> test-dma_loopback] Current loopback throughput: 2MB/s
[init -> test-dma_loopback] Current loopback throughput: 725MB/s
[init -> test-dma_loopback] Current loopback throughput: 727MB/s
[init -> test-dma_loopback] Current loopback throughput: 727MB/s
[init -> test-dma_loopback] Current loopback throughput: 727MB/s
[init] child "test-dma_loopback" exited with exit value 0
Expect: 'interact' received 'strg+c' and was cancelled
Running the test scenario from the repositoryIn the genode-zynq repository, you will find an extended version of the presented test scenario. It acquires throughput results for different buffer and access sizes and runs the test component for both DMA IP cores in the hardware design. With the following steps, you are able to reproduce this scenario.
I'm assuming you have already cloned the main genode repository and the genode-zynq repository (see Genode 101: Getting started with the Zybo Z7). This also implies that you have created a build directory for arm_v7a.
As a prerequisite, you need to download the bitstream archive. This is achieved by the following command that must be run from within the genode worktree.
genode #> ./tool/depot/download jschlatow/bin/arm_v7a/zybo_z720_dma_loopback-bitstream/2023-01-16
Now, you are able to run the test scenario with this command:
genode #> make -C build/arm_v7a run/dma_loopback_test BOARD=zynq_zybo_z7
This scenario runs a bunch of throughput tests and prints the results. I compiled the output of my test run into a table that shows the throughput in MBytes/s for different buffers and access sizes:
Buffer size | 8KB | 32KB | 128KB | 512KB | 2MB |
Interface | HP | ACP | HP | ACP | HP | ACP | HP | ACP | HP | ACP |
------------|----------|-----------|-----------|-----------|------------|
Access size |----------|-----------|-----------|-----------|------------|
64 | 81 | 92 | 323 | 368 | 582 | 429 | 868 | 516 | 1000 | 573 |
256 | 59 | 90 | 238 | 362 | 503 | 427 | 828 | 515 | 987 | 575 |
1024 | 28 | 84 | 115 | 338 | 323 | 418 | 673 | 511 | 924 | 573 |
4096 | 9 | 68 | 37 | 272 | 132 | 389 | 385 | 498 | 736 | 569 |
16384 | - | - | 10 | 155 | 39 | 309 | 142 | 454 | 404 | 552 |
65536 | - | - | - | - | 10 | 173 | 40 | 347 | 144 | 504 |
Looking at the results, one can see that the ACP is faster than the HP for smaller DMA buffers and larger access sizes. However, the HP is able to outperform the ACP in moving quite large buffers of which only a very small fraction has lately been accesses by software.
Note that I haven't used the scatter/gather engine of the axi_dma IP core in this test, yet. Currently, the DMA core signals the PS and must wait for the software to program a new transaction whenever a transaction completed. This limits the throughput especially for small buffer sizes. Since the scatter/gather engine supports queueing of multiple DMA transfers, I expect it to improve the throughput results for smaller buffers.
Comments