This document is a thorough tutorial on how to implement a DMA controller with Xilinx IP. My idea was to write a comprehensive guide with all Do’s and Don’ts related to the implementation of DMA system with Xilinx IP's.
I’ve spent a lot of time searching for such tutorial but could not find any useful ones which fit my needs. I wanted to design a quick DMA controller to check the bandwidth of Xilinx PCIe, preferably free, which will work flawlessly.
Eventually I decided to do it on my own, and after successfully designing it with very impressive throughput results, I thought it would be a good idea to publish an article.
I did my best to stay away from RTFM usual stuff, and really gave the tips and tricks which have helped me to implement a fully working DMA system with Xilinx.
This guide is divided to 3 parts:
- Part 1 is a general explanation on the notion of DMA, I’ve put some links to DMA tutorials (since there are many), but also interesting facts related to DMA with Scatter-Gather descriptor table.
- Part 2 is dedicated to the XDMA of Xilinx. This is the main core in my project and I’ve explained all the tabs (almost all options, except advanced feature which I did not use).
- Part 3 goes over all main blocks in my projects. This is an extensive overview of all the blocks I've used and I think it will give a nice starting point to whoever wants to implement DMA with Xilinx XDMA core.
DMA (initials for Direct Memory Access) engine is a key element to achieve high bandwidth utilization for PCI Express applications. It frees up CPU resources from data streaming and helps to improve the overall system performance. In a typical system with PCIe architecture, PCIe Endpoints often contain a DMA engine. This engine is controlled by the system host to transfer data between system memory and the Endpoint
I will not go over the DMA concept and methodology (Wikipedia can help for that issue). I will briefly explain 2 main types of DMA transfers:
1. “Common-buffer DMA” (“continuous DMA”)
2. “Scatter/gather DMA”
A common buffer is based on one buffer to deal with. The base physical pointer of the buffer is passed to the hardware and the data transmission starts from this point. Pretty straight forward indeed.
The main disadvantage is it’s one single buffer. If it needs to be rather big (and big can be few MB’s) there could be a problem allocating it, and there’s the issue of the ownership. Who owns it? Is it the hardware (driver) or the host (PCI root port)?
Scatter-gather solves the problem of the contiguous buffer by allocating many small chunks, each has its own address and size. This is called a descriptor table and it is comprised of a table of address and length pairs, which prepares the engine for all transfers.
The DMA engine incurs very little overhead when switching between buffer locations, because no software intervention is necessary, all is in hardware logic.
The buffer created is not contiguous (physically). This is a virtual-to-physical mapping. The hardware knows it, the driver knows it, the software does not. It sees it as fully contiguous buffer. This is the simple philosophy behind it. The physical memory is divided into discrete units called pages. Page size varies from one architecture to the next, although most systems currently use 4096-byte pages. These physically non-contiguous pages can be assembled into a single, contiguous array from the device’s point of view.
This table of address and size must be saved & handled somewhere (some physical non-volatile storage, like DDR or BRAM, for example), prior to DMA transaction, and the ownership comes for free, as each chunk is owned by the hardware (driver). Obviously, the physical pointer to each small buffer needs to be passed to the hardware DMA mechanism and someone needs to handle these pointers and decide when to use each one. The driver can do it, or the FPGA can handle the passing of descriptors. In Xilinx language, they refer to it as “Descriptor Bypass”. I’ll explain it later in more details.
As pointed out, even when allocating 10MB as a contiguous space, there could be hundreds of descriptors allocated for this task, each one is 4096 Bytes (‘Page size’). Obviously, small number of descriptors, smaller table, is much preferred, since this table can be stored in the FPGA internal RAM (BRAM) and thus we gain less latency, more error prone mechanism, etc.
When the driver maps the non-contiguous memory (physical memory) to contiguous memory we call it a virtual memory. The driver can map the buffers prior to operating system start up, thus the chances to allocate large buffers is higher (but not guaranteed). In such case the mapping will be done in kernel space. In case the driver maps the buffer after the OS has started, there will be much more pages to allocate, as the PCIe peripherals have first taken the big memory chunks. In such case the mapping will be done in user space.
In FPGA applications which involves PCIe and DMA, the DMA usually refers to moving data between the host memory and the internal memory that resides in the FPGA. Moving the data from the host to the FPGA is referred as Host to Card (H2C) direction, which is a DMA read channel, while moving the data from the FPGA to the host is defined Card to Host (C2H) direction, which is a DMA write channel.
In the following figure I’ve tried to explain the mechanism. Each descriptor points to a different size of allocated buffer in the system memory (multiplications of 4096 bytes). The Driver is in charge of these allocations and once the Scatter Gather table is set, the FPGA/driver manages this table and passes it to the DMA engine.
For this tutorial, I decided to go with a H2C direction, based on a scatter-gather driver configuration. I’ve chosen the FPGA to manage the table; the ‘ownership’ I’ve referred earlier. The driver oversees the allocation of the memory, but the FPGA knows when to pass a new descriptor of the table. This is the basic idea of the DMA controller.
HardwareFor my project I chose to work with KCU105 XIlinx Ultrascale evaluation board with PCIe * 8 lane width. This board consists of Kintex UltraScale XCKU040-2FFVA1156E device and a lot of other goodies to play with.
PCIe + DMA solutions:Clicking on the ‘+’ icon in the Vivado block design (BD) and looking for 'PCI' brings up these options:
There are various solutions the user can choose from. For start, we’ll need Xilinx AXI Bridge for PCI Express. This is the basic building block which enables PCIe interface:
Still, this block does not include any DMA implementation, so the user must wrap his own DMA mechanism on top of it, meaning a DMA mechanism must be written from scratch and wrapped over this block. There are various solutions for such wrapper, and for different methodology of the DMA controller, as well. I will name here various applicable solutions:
The links above are shown here: North East Logic DMA block, XAPP 1052, XAPP 1171, XAPP 1289, Microblaze forum link
After going over the various options, eventually I decided to go with option 7; if it’s free and it gives all these wonderful features, why not use it? There are some drawbacks, I will mention them later.
Regarding PG195: PG195 is a very informative manual (about 150 pages). I’ve read it a few times, just to get the hang of it. It has many simulations and example designs (chapter 6) which in the early stage of my design tended to be useful. I must say that going over this chapter made me feel a bit dizzy, as it is not very clear what to do in each simulation. Eventually I succeeded to design the basic PCIe simulation with my DMA control mechanism, but it was not so easy to set it up.Driver
Xilinx has their own driver, with a very informative manual (AR65444). The AR is straightforward manual with all needed code (C language) for setup the driver with a DMA test (H2C and C2H). It has a readme file which explains the user exactly what to do and how to compile the driver. It gives a good starting point to understand the DMA concept, as the core registers are widely used and I used it successfully.
You can choose between working with Windows driver and Linux driver. The concept is the same.
AR71435 is a very thorough tutorial on the interaction of the driver and the XDMA. This can give an additional information on how to use the driver. I suggest to read it and run the various tests implemented.
Jason Lawley, a Xilinx expert to PCIe application has a great tutorial on getting the best performance with Xilinx’s DMA engine. I strongly urge anyone who plans to design a DMA controller to first go over this tutorial, together with AR65444.
So, after this long introduction, I think it’s time to move on to the real deal.
RequirementsTo complete this hands-on tutorial, you will need the following:
- Vivado 2018.2 – the screenshots are taken from this version. Obviously, higher version will also work, but the section numbers I’ve wrote here may be different.
- KCU105 Evaluation Board
- Xilinx PCIe Driver
In the following part 2 of my tutorial I will dive deeper into the implementation. I'll start with the block diagram of my design. The image below gives a high-level view of the design including all main blocks and how they connect to the XDMA main IP Core.
The block diagram in the figure above is the full design of a basic PCIe + DMA with Scatter/Gather mode and Descriptor Bypass feature enabled (one channel).
I will now go over the main blocks and give tips, tricks and insights on how to do it right and gain remarkably high throughput results.
XDMA - block 1After Importing the “DMA/Bridge Subsystem for PCIe” block to Vivado BD, double-click on it to open the IP customization window. It shows the first screenshot. I will now go over all tabs and explain most of the options and what is the optimal value to insert.
After configuring the PCI link speed (8GT/s) and the lane width (x8), the AXI Data Width changes automatically to 256 bits and the AXI Clock Frequency to 250MHz.
Next, I selected:
a. AXI Stream checkbox, as I wanted continuous DMA data flow.
b. AXI-Lite Slave interface checkbox, which enables me to program the DMA core internal registers, whether it’s performance counters or reading the number of descriptors, etc.
Regarding the 4 checkboxes at the bottom:
PIPE generates the core that can be simulated with PIPE interfaces connected (I’ll explain it later on).
The DRP (Dynamic Reconfiguration Port) checkboxes allows the dynamic change of parameters of the transceivers and common primitives. It has a processor friendly interface with an address bus, data bus and control signals. I’ve left it unchecked.
The Additional Transceiver Control and Status Ports adds Transceiver debug ports. The IP advises to change them in accordance with the GT user guide.
I’ve left all these un-checked (default state).
PCIe ID TabThis tab holds info on the PCIe endpoint (Xilinx FPGA). The user can change all the fields. Obviously, since the driver communicates with the PCIe endpoint, the device ID (at least) must be identical to the device ID used in the driver code.
The Vendor ID is vendor specific. PCIe SIG website has a list of various vendor ID’s (here) and the driver translates the vendor ID accordingly. The default value of this field is 10EE and from looking at the website we can see there’s a match:
The user can insert any value, but for the good engineering practice it is better to use a known value. It affects only how the driver translates this number to a specific vendor. For example, when inserting the value: 0x1172, the driver will identify the PCIe endpoint as Intel (Altera), whereas entering the value 0x10EE will be identified as Xilinx. Other than that, there’s no other implication to the on-going work.
A great tool which helped me alot during my debug phase with Linux was devmem2 tool. It enables the user to read/write to specific locations in memory without any dedicated program (running it from SSH terminal). This is a great advantage of using Linux over Windows since you have access to memory locations which Windows users are not allowed to handle.PCIe: BARs tab
The next tab relates to the various PCIe BARs. The default values and checkboxes are as follows:
BARs (Base Address Registers) are used to define the memory addresses and I/O port addresses (memory-mapped I/O) used by the PCI devices. They define the memory space and start address of a contiguous mapped address in system memory or I/O space.
The endpoint (our FPGA for this matter) requests a size of memory (contiguous) which is then mapped by the host memory manager, and the BAR (shown in the endpoint PCI configuration space) is programmed with the base address for that endpoint’s BAR0/1/2 field in the endpoint’s configuration space.
Xilinx has a great explanation about BARs in AR65062
This whole process is carried out in the lower level of PCIe, BIOS, driver, etc., so the common user need not intervene in this process. Nonetheless, as these BARs have implications on our design (see next paragraph), the user should decide what to define in these fields.
There are various checkboxes available. The user manual defines them well.
Master AXI Lite:This interface related to the memory map register interface. According to PG195 manual:
I’ve checked this check-box, as I wanted register memory mapped interface available, meaning, for register interface to the outside world, this checkbox must be checked.
Master AXI Bypass:Same as Slave AXI Lite, but full AXI. I did not check it, as AXI Lite was enough for the implementation.
Slave AXI Lite:DMA/Bridge Subsystem for PCIe registers (i.e., the internal registers of the core) can be accessed from the host or from the AXI-Lite Slave interface. These registers should be used for programming the DMA and checking status. I’ve checked this checkbox.
Moving on to the next fields, these are the ‘Size’, ‘Value’ and ‘PCI to AXI Translation’ fields; the ‘Size’ and ‘Value’ fields are in charge of the allocated space used for this BAR. For example, looking in my configuration below, you can see I’ve allocated 64MB of memory for the ‘PCI to AXI Lite Master Interface’ which means the register address space mapped could be the size of 64MB (so address space is ).
The ‘PCI to AXI Translation’ translates the PCI address to AXI territory. No matter what address the host uses to place the PCIE BAR within the host address space, any host access to that BAR will translate to an AXI address of 0, in our case. In other words, all accesses to the BAR will be translated to a base address of 0 in AXI space. Need to remember, though, that using the default value of 0 will cause all accesses to the BAR to be translated to a base address of 0 in AXI space. This seems logical if the BAR size is large enough, but in case there are multiple AXI peripherals that acquire access it could limit them and cause issues.
Now we’ve left with 2 last checkboxes: ‘Prefetchable’ and ‘64bit enable’. The Prefetchable option enables faster operations between the CPU and the memory. It is a region of memory marked as prefetchable and the CPU can request in advance as an optimization.
Taken from PCI Express Base Specification:
So, you guessed right, checked both…
Putting it all together, in my project this tab looks like this:
Xilinx XDMA, even if very easy to implement, and very straight forward, does have a few drawbacks. Though they are not a deal-breaker from my point of view, still, the average user must know them before starting to work with this core:
1. There are only 4 RX DMA channels and only 4 TX DMA channels. This means if you need more than 4 for your design, you cannot use this solution.
2. The XDMA is a Xilinx wrapper for the PCIe bridge. This is simple as that. What it means, is if you do want to implement further enhancements (like adding more channels), this cannot be achieved, as all under the hood – cannot be seen by the user. This may be sufficient for the average user, but when thinking ahead to a more sophisticated implementation of the DMA with this core, this is a show stopper.
Other than that, Xilinx did a great job with this core. It is simple to use and easy to implement in your designs. The addition of DMA_Reset_bridge feature is a life-saver for the cases where you did something wrong, and you want to reset everything except the link you’re using. This is an example of how Xilinx made an effort to ease the implementation phase with the XDMA for the average user.
MISC TabIn this tab I did not alter anything. I decided not to use Interrupts in my design as Polling is much preferred in terms of bandwidth. You can watch Jason’s tutorial on PCIe at the specific time how performance is much better with polling compared to interrupts.
In this tab I’ve defined Read and Write channels, 4 channels each (maximum). Regarding the “Number of Request IDs for Read channel” and “Number of Request IDs for Write channel” I did not change it (the default is as shown). Even though I wanted to design a very simple example design with only one Master. AXI protocol supports minimum number of ID’s as 2, so I could change it. Nonetheless, I did not change it.
Furthermore, Xilinx has a nice feature called: Descriptor Bypass. This enable achieving high performance and bandwidth. I’ve checked it as I wanted to receive the highest performance from my setup. Descriptor Bypass means the descriptors are handled by the hardware, and not by SW or driver. The implication of that is the user must write his own logic for this mechanism, and I will warn you it is not straight forward. Adding this feature will place an input port named dsc_bypass_h2c/c2h.
DMA Status Ports – I’ve checked it as it could help in my logic implementation.
So, before updating this tab it looked like that:
And after changing all checkboxes as written above the core looks much more interesting, not to mention complicated:
In an effort to save you from going over the whole 150 pages, I thought about giving you some references to the most important and interesting parts in the manual.
Descriptor Bypass PortsLooking at table 33 and table 34 at PG195 we can see the ports which are in charge for the Descriptor Bypass feature. These tables, together with table 8 (which I'll cover at Part 3) will ease the design phase towards a working DMA PCIe with Descriptor Bypass feature.
Pay attention to the fact that if you're using Descriptor Bypass, like I was, then any reference to SGDMA is non-relevant. Also, I wanted to emphasize that in my design I did not use MSI and IRQ for DMA since I wanted a better performance. I explained it earlier and refer you to watch Jason’s tutorial on PCIe at the specific time how performance is much better with polling compared to interrupts.
DMA Channel ControlMoving on to Tables 40 (for H2C) and Table 59 (for C2H) - I placed a part of the H2C table:
The Run bit is obviously the one you would want to control in your design. Most of the others are used for logging.
DMA testDMA test is a nice feature you may want to implement at the end of your design phase. obviously, you would want to test your design mainly in terms of throughput.
For this purpose Table 52 - Table 56 (for H2C) and Table 71 - Table 75 (for C2H) comes handy.
The idea here is very simple and I'll explain it for H2C, as it is exactly the same for C2H. Just set the Run bit mentioned in Table 52 and then pass a predefined data file (counter, for example) from your host towards the board (when testing H2C direction) , measure the cycle counts and the data bytes, mentioned in tables 53 and 55, Respectively and divide them to receive the full throughput. Just pay attention to the fact that the data counts mentioned in table 55 is in beats (that is 4 bytes) and not in bytes, so take that in mind with your calculations.
Now, after going over all tabs, there’s one small issue I’ve wanted to mention, and it helped me a lot during the debug phase. This is the dma_bridge_resetn port, mentioned in table 13:
This optional pin allows the user to reset all internal registers of the XDMA core. During the debug phase there were a lot of times I’ve needed to reboot the board, since the core/board was stuck (as a result of a bad logic design), and then, while going over the manual, I’ve stumbled upon this pin.
Chapter 4 in PG195 manual explains how to enable it (by default it is disabled). You must read it carefully, as not following the rules mentioned here will cause the whole design to malfunction (data corruption, etc.). I’ve highlighted the most important point here, since this caused me the biggest headache. I've set this pin before the PCIe transaction was actually finished. Took my long time just to realize what I've done, so be careful!
Even though the PIPE is just a checkbox, it deserves a through explanation. To accelerate verification and device development time for PCIe based sub-systems, PIPE (PHY Interface for the PCI Express) architecture was defined by Intel. The core supports that to enable faster simulation speed. What it does under the hood is removing all PCIe transceivers from the simulation and this can speed up simulation time.
These 2 figures from Xilinx tutorial on Mentor QVIP explains it:
When not using PIPE:
When using PIPE:
Since I’ve received my board quite fast, I did not have to use this feature, but just to give an idea, the PCIe simulation takes about 20 minutes till the user receives a ‘Link up’ from the simulator to start a data transmission. Think how annoying this could be to wait so long every time you change something in the code.
Xilinx AXI BFM has been discontinued as of December 1, 2016 (read it here) and not supported after Vivado 2016.4 so in any case if you’re using higher version you cannot use AXI BFM.
Going back to the VIP, I’m sure there are many solutions to work with VIP, so I’ll name a few I’ve found:
XAPP1184 is a nice app which has a link to download a free BFM for PCIe simulations from Avery-designs.com. It uses Cadence IES (which is not free). Xilinx, by the way, has a tutorial on how to work with Cadence IES with Vivado here.
Mentor Graphics Questa Verification IP (QVIP) has another PIPE solution and there’s a Xilinx tutorial on how to use it. It uses the Questa Verification tool.
Part 3 - DMA – Direct Memory Access!In the following part 3 I will go over each one of the blocks I've designed and implemented towards a full working functional DMA PCIe system. This is the last part of the 3 parts tutorial.
I'll post here again the block diagram as shown in part 2:
Block 1 was explained at Part 2 so I'll skip directly to Block 2.
AXI to Native Block - Block 2When starting to design such a project, one needs to define an external register interface to the outside world. This obviously is based on some form of communication (USB, I2C or… PCIe!). I chose to implement the PCIe, since the whole project is based on PCIe. What we need is a WR/RD command from the driver towards the user logic. A short scheme will best describe it:
The first thing we need to do is to define an AXI to Native block. Vivado can make this for us. Using the ‘Create and package new IP…’ option (via Tools menu) the user can create the needed source files. Opening this menu will give us several options. We’ll need to define the new IP as a Slave, Lite AXI is enough. All other fields should be left unchanged, as shown below:
After Vivado creates the source files, we will change them, as the Master Write/Read commands are internal, and we want to make them external.
The created block defines the numbers of registers named slv_reg0, where’s ‘0’ is the index number of the register. In the screenshot above I’ve defined 4 registers, so the code will have slv_reg0 – slv_reg3. These registers are not used at all in my project. They are internal and only used for debug purpose (configured as RD/WR registers). Looking at the below code snippet we can see that slv_reg_wren represent a Write command from the driver towards the user logic.
As such, we want to output this WR command outside of this block and use it as a driver write toward our user logic. We’ll define it simple as that:
cpu_wr <= slv_reg_wren; --cpu_wr is an output port
AT this point I want to stop and explain a few things before moving forward.
Actually, I could have defined slv_reg_wren as an output port and delete the above code (remember I wrote I do not use it?), but I wanted to make as little changes as possible in Xilinx code. Since in VHDL we cannot read the state of an output port (unless we’re using VHDL-2008, which I did not use in my code), I preferred to leave slv_reg_wren as a signal and define a new output port called: cpu_wr.
Same rule applies to all other commands as defined in figure above. Xilinx source code uses them as internal signals, which we'll need to define them as external and use them in our user logic code as a method to pass values via PCIe interface to/from our user logic.
At the end of Xilinx automated created code I’ve simply added these lines (after defining them as output ports):
All in all, we now have a block which enables us to interact with our user logic via PCIe driver!
Now we can move forward 😎.
Config AXI Modules - Block 3In block 3 (“External Config AXI Modules”) we can see 2 sub-blocks. These sub-blocks give the user a full native to AXI solution.
Starting with config_AXI_core sub-block, this block translates between outside world commands to a full AXI interface, via the config_AXI_to_reg block.
Hard to follow? This is not so complicated…
There are 2 situations in which the AXI core can be configured during run-time of the design:
- Inside your RTL code (part of a state machine, for example). This is what the config_AXI_to_reg is used for.
- Nonetheless, If the user wants to read/write registers via a communication interface (UART, Ethernet, or PCIe in my case). In such case, the user must have some sort of an interface block. This is exactly what config_AXI_core is used for.
Regarding config_AXI_to_reg, since my code is not AXI compatible (rather pure ‘native’ code), I must have some sort of Native to AXI interpreters. I could have written all my blocks with AXI interface. A different solution is all my native source blocks will be translated to AXI in order to configure the various AXI blocks in the design.
A rather nicer solution was to work with MicroBlaze for these AXI configurations. Indeed, I admit this is more elegant solution. Still, I chose to work fully with VHDL all the way, and this includes AXI configurations. I think it is simpler eventually to have it all in one RTL block, rather than split it between MicroBlaze and RTL blocks. The drawback is all configurations must be carried out in RTL, which is a bit cumbersome.
Going back to the block diagram, we have various AXI interface blocks. Each one needs to have its own Native_to_AXI configuration interpretation block connected to it.
How do we define such block?
Using the ‘Create and package new IP…’ option in Vivado (via Tools menu), same as we did with the CPU commands, whereas now we’ll define our block as a Master (compared to Slave in the CPU command block).
Once I created this block, I imported it to my BD as an RTL module. Let’s look together inside the source code created. Looking at the ports created, we first see this input port:
which is used by the user to initiate an AXI transaction. Obviously, you’ll need that.
Other important ports include all AXI4 Lite ports. We’ll need part of them, to say the least, for our configuration.
I wanted to focus on these interesting 3 signals; AXI4 internal state machine signals:
Putting aside the INIT_COMPARE command which is not needed (only used for comparison in Xilinx’s example source code created), the INIT_WRITE and INIT_READ are used by the user to choose which action is needed (Read or Write).
Eventually, the user should define all AXI4 interfaces as ports (at least what is needed in order to run Xilinx AXI4 state machines used in the source code). In the figure below you can see what I chose to work with as an external AXI4 interface.
Our next block is a vital one. This is the DMA channel block This block holds all sub-blocks which relates to the functional design of the DMA controller.
This block configures the registers and activity of the XDMA IP Core. It has various sub-blocks and I’ve numbered them as A-E.
RXTX_DMA_controller – Sub Block AThe controller source code should have various state machines; XDMA config (native to AXI, as I referred to earlier) to interact with the XDMA registers, descriptor_bypass state machine (in case you would want to receive higher bandwidth, this is recommended) and other state machines according to your logic design, but these 2 are the most important ones.
Since I decided to control the descriptors by my own, I had to define an external port to my block and connect it to the input dsc_bypass_h2c/c2h in XDMA.
Figure 6, 7 & 8 from PG195, illustrate how all this should be carried out.
The user needs to design the state machines according to these figures, straight and simple. Other than that, I did wanted to share a few hints regarding figure 8 (PG195). This figure illustrates a few implementations of the descriptors bypass for the user. You can pass the descriptors in any one of the 3 options as seen in this figure (all or one).
The simplest way, as I see it, is option A. Both Option B and option C force you to set all signals at the same clock tick the dsc_byp_load is sampled high (which is feasible only if all command signals are variables – or else they will be set only next clock), so this is what I did, eventually (option A).
Furthermore, you could (or should) also define status ports in your block and connect them to the DMA Status Ports at the XDMA. These are good for your state machines, as they are not registers, rather simple I/O ports which makes your design much simpler. Needless to say, all these ports can be accessed by XDMA registers also – but this is far less convenient, as this involves going back and forth between your state machines, adding latency to your design (since you have to wait for ‘AXI done’ signal before issuing another AXI write/read command).
Config AXI to reg – Sub Block BThis sub block is exactly the same sub block I've placed in Block 3 ("Config AXI modules"). Remember the 2 situations I've explained, regarding when does the user need to config AXI core?
So, here I was refering to the first option:
There are 2 situations.... --> Inside your RTL code (part of a state machine, for example). This is what the config_AXI_to_reg is used for.
DDR AXI Master – Sub Block CSince I wanted to add support with the DDR already installed on board the KCU105, I’ve added this block. This block is connected to the DDR via the AXI_InterSmart_Connect block (block 5, see later on) and gives access to read/write from/to the DDR.
This block is taken as a whole from Silica Github source code (Designing-a-Custom-AXI-Master-using-BFMs). The idea to use it was given to me by Itamar Kahalani, Xilinx FAE. But why do I need it actually? I can use Xilinx ‘create new IP’ method to design a native to AXI block in this simple way:
The blue sub-block above is the same block which was covered extensively before. So, what is the problem with this method and why didn't I choose it?
The problem is if we want to write/read from the DDR in bursts, using the above method, it cannot be done. Using Xilinx ‘Create and package new IP’ indeed creates an AXI interface the user can modify, but there’s no way we can use an AXI burst mode to write to the DDR.
Using Silica free code is a great solution for this problem. Silica published a full featured source code, including pdf manual, simulation files and practically whatever the user needs in order to interface with any AXI block, including Burst mode. I needed it as a Master so I’ve used the sources taken from this link, but in any case, Silica has the same sources also for Slave mode (Designing-a-Custom-AXI-Slave-using-BFMs). Now my path looks a bit difference:
It may seem more complicated but on the contrary. You get so many options and new features using Silica source code and this totally levels up your design.
BRAM Logic block - Sub Block DSince I chose to work with DMA descriptors in bypass mode, I needed a non-volatile memory to store them. The simplest solution would be the BRAM. Bear in mind that if/when the number of descriptors is extremely high (meaning, the driver mapped tons of non-contiguous pages to the DMA), you could end up with overflow in the BRAM, so consider writing the scatter-gather table into the DDR. In my case, this did not happen, so I’ve worked with BRAM.
When working with BRAM, the user can choose between native interface and AXI. When looking for ‘BRAM’ in the search bar we come with 3 options:
The upper two components are the ones we’re interested in. The first is the AXI BRAM Controller, and as you guessed, it is used when you want to interface with the BRAM using AXI protocol. Looking at the IP customization window:
You can see the memory depth is greyed out. This is a common question raised which is explained nicely at AR 66103. Only after you give it address range in the address editor, will you be able to configure the memory depth, so take that in mind. You could see I’ve chosen AXI4Lite rather than AXI4. This was sufficient as I did not want/need the full AXI4 protocol and AXI4 Lite suited my need.
Moving next to the BRAM generator, looking at the configuration window, there is a dropdown list which enable to choose between 2 options; “BRAM Controller” Vs. “Stand Alone”. The first relates to the AXI4 interface while the latter to Native (in such case no AXI BRAM controller is needed).
When working with “Stand Alone” mode (Native), pay attention to the small checkbox near the ‘Stand Alone’ mode called: “Generate address interface with 32 bits”.
This checkbox instructs the BRAM to use byte addressing. When unchecked, it reverts to word addressing. Remember that when it is ticked the logic driving the read/write port is incrementing the address by 4 each time, as this could be confusing.
Naturally, when working with “Stand Alone” mode, the memory depth is not grayed out and the user can change the BRAM size needed, as shown in the picture:
AXI Subset Converter – Sub Block EUsing Xilinx AXI Subset converter is a must in this design. The DMA data keeps flowing without anyway to stop it, so I needed to have some way to back-pressure it. In the picture below you can see the implementation. The tReady port is passed outside the DMA_channel block and will be later used as part of my User logic design to back-pressure the DMA data passed in the PCIe communication.
Lastly, we must use AXI Interconnect to connect all blocks together. You can use AXI SmartConnect (SMC) or AXI Interconnect (IC). Both will do the job. Just bear in mind that:
- SMC is better in terms of bandwidth, but logic resources costly.
- On the other hand, IC has arbitration priority which is not supported in SMC.
- Resource Utilization for both are here: SMC, IC
- A nice trick I’ve learnt from Itamar Kahalani, Xilinx FAE, is to combine between the two options. This is a great idea on how to gain benefit by using both IP’s. In the bottom figure you can see how the two are connected.
The user should connect the hungry bandwidth consumers to the SMC (M01_AXI is connected to DDR in my case), while keep the less “important” consumers on the IC. They will also gain access to the DDR, but with less performance than their neighbors in the SMC (note that the IC uses 1 input of the SMC). Also, see that both Resets that goes into the IC, passes via logic AND gate towards the SMC so reset is mutual to both IC and SMC.
The last 2 blocks gathers all user logic and DMA control registers in one place (mon_fsm = Monitor FSM block, used to hold all state machine debug registers in 1 place). This is done just for making life easier, it is not a must and surely your design can run also when all the external registers' interfaces are wondering around your design. It is a choice of the user whether to place all external register in one block or not. Still, from my experience, it is more convenient when you look for a register at a specific block and you don’t need to remember at which block did you place it.
Both blocks are designed the same. The AXI to Native block (block 2) defines the register interface signals (from AXI to Native). These Native signals interact with the outside world using PCIe interface, and go into these blocks.
Putting it all togetherSo, after all is designed and connected, debugged and verified, the last step is to design a DMA test component. Such component can be designed quite easily and I've explained it in Part 2 of this tutorial.
Below you can see the results I have seen using 4 channels of H2C direction after capturing the data and cycles and dividing them with each other (using the Cycle Count and Data Count registers as explained in Part 2). I've calculated the total throughput to a 7.49GB/s.
This is not so accurate, as the tests did not start at the same time (I had a problem with the Linux server I could not run it in 'pure' parallel). Still, I can say I've ran it few times and the average results were all above 7GB/s. Well done, Xilinx!
So, last words before I finish this tutorial. As mentioned, I think the XDMA is really a life saver regarding the implementation of DMA with PCIe system. The manual is loaded with registers and other important stuff which seems to be pretty clear for the average and above-average user. There are a few drawbacks I've covered earlier (such as maximum number of channels is 4), but overall, nice job!
Comments