Virtual Power: A Deep Dive Into Xilinx's Hypervisor on the Zynq UltraScale+ MPSoC
Explore the granular details of why the Xilinx Zynq UltraScale+ FPGA on the ZCU102 development board is ideal for AI modeling on the edge.
You may be hearing the term “the edge” or “edge computing” quite a bit right now, but what is it and why is it so important right now? Despite everyone having their own personal computers, we have transitioned to the age of cloud computing and streaming for our data processing / storage needs. We work from services on centralized servers such as Google Docs and Apple’s iCloud. The devices in our home also operate based on cloud services such as our Nest Cameras, Amazon Alexa, AppleTVs, and so on. This presents the challenge of needing to transmit large amounts of data back and forth between cloud servers and a given client with low enough latency that the client’s application can perform adequately. The cloud servers themselves also have to physically exist somewhere in the world, meaning that the companies hosting those cloud services like IBM, Google, Amazon, and Apple are paying to maintain large servers as well as ensure they are expanding fast enough to meet user demand.
For some applications, this exchange between a client device and the cloud server becomes a bottleneck that cripples the application's performance. One of the prime examples of this artificial intelligence (AI) with its demanding workload for processor intensive tasks such as machine learning and environmental sensing, coupled with the need to instantaneously react to input data. Imagine if the forward collision sensor of an autonomous car had to send its data to a server external of the vehicle for the path planning AI algorithms to be ran, then the car have to wait for the answer to be returned back prior to being able to react to an imminent danger or obstacle. The necessary solution is to bring some or most of that processing power from the cloud directly to the device itself to alleviate both the latency problems and overloading of both the cloud server and network as user demand increases.
This is the core principle of edge computing: to offload data processing from the cloud server and performing it at a node somewhere near the "edge" of the client’s internal network topology. This means the data is being processed where it is collected to address the time/latency sensitive tasks while eliminating the bottleneck of network congestion.
Being able to run artificial intelligence applications at the edge of a network requires both a hardware and software solution capable of low latency processing, adaptive network protocol implementation, multiprocessing, and an element of reprogrammability to keep pace with the rapidly evolving AI models. This lays out the framework to achieve the decentralized computing structure necessary for AI computing at the edge of a network. This addition of parameters added to a client device's requirements is almost an exact mirror of the features list of an FPGA system on a chip (SoC) device. The combination of flexibility offered via the programmable logic of the FPGA and the hardware peripherals offered via a hard instantiated processor core within the FPGA makes the hardware selection a simpler matter as it is an all-in-one solution.
The Xilinx Zynq UltraScale+ MPSoC chip is a formidable powerhouse with hardware such as its quad-core Arm Cortex-A53 processor with a single and double precision floating point unit (FPU) processor, dual-core Arm Cortex-R5 real-time processor, Arm Mali-400MP GPU, and DDR4/3/3L and SMC memory controllers just to name a few. While each of the cores in both the quad-core Arm Cortex-A53 and the dual-core Arm Cortex-R5 are capable of running independent software stacks (aka an operating system, baremetal application, or RTOS), there are some restrictions as to which hardware peripherals can be shared between them when running. Some memory resources have to be divided up and can't be shared at all such as L1 cache and memory management (MMU) resources.
To overcome these obstacles and allow for an AI application to take the fullest advantage possible of the hardware peripherals offered by the Zynq UltraScale+ MPSoC, independent software stacks can be virtualized and run as a virtual machines (VM) on the chip using a hypervisor as a resource manager. While running, the hypervisor allows each VM access to system resources such as processor cores, memory, I/O interfaces, and custom logic in the fabric of the FPGA. Think of it as having a single laptop running multiple VMs simultaneously where each is running a different OS such as Linux, Windows, or Mac OS X, and how you would be able to easily switch between each of them while working since they're all running as VMs (versus having the harddrive of the laptop partitioned where you would have to reboot into of the OS's and only run the one at a time).
Virtualization is the process by which a software stack is isolated in such a way that it appears it is the only software running on the system. This isolation prevents any of the software stacks from unintentionally interacting with each other. This is ideal for AI models with machine learning algorithms as the entire model could be thrown off if the input/output from another AI model were to accidentally taint it. It also makes the development of the models more straightforward and cleaner since they can each live on their own OS or standalone application. Other advantages of using VMs in the edge computing topology is that the frontend tasks like taking in user input and backend tasks such as interfacing with the cloud server can be isolated to their own software stack since the hypervisor can facilitate communication between each of the VMs running.
Virtualization has a set of minimum hardware and software requirements in order to be able to run a software stack as a virtual machine.
Hardware needed:
- A system timer for the hypervisor not available to any of the software stacks.
- A Peripheral Protection Unit (PPU) to act as a gatekeeper of the access to the various peripherals available for the software stacks to use.
- An interrupt controller that can handle both hardware and software interrupts.
- A Memory Management Unit (MMU) for virtual to physical address mapping.
Software needed:
- A hypervisor to abstract the hardware layer from the software stacks so it appears to them as if they are the only software running on the platform.
- Para-virtualized drivers, which are the drivers that actually connect the driver within the virtualized application to the peripheral it is calling.
There are two types of hypervisors: one that is booted directly by the hardware and acts as a layer between the OS and the hardware, and one that runs on an OS and uses the hardware through the OS itself. The second type of hypervisor is what VM applications such as VirtualBox and Parallels for Mac are. The hypervisor that is currently built into the Xilinx platform is Xen, and it is a Type 1 hypervisor as the focus of this particular article.
A hypervisor is capable of mapping various applications and tasks from the software stacks to different cores of the CPUs in one of two ways. The hypervisor can either map one OS to one particular CPU core, or it can multiplex each OS or even OS application/task between the various CPU cores. This is where one of the main advantages of using virtualization is demonstrated in how multiple OSs running on a single chip can essentially have access to the entirety of the chip's hardware at any given time necessary. This means that virtualization can be used to create redundancy in a system since if one OS crashes, it does not affect the others therefore one of the other OSs can be configured to take over the crashed OS's tasks. This can also be taken advantage of from a security standpoint in that vulnerable points in the system can be isolated from other elements in the system by placing them in separate OSs (VMs).
Xilinx's Zynq UltraScale+ chips were designed with the appropriate hardware to facilitate virtualization with its optimized MMU, SMMU, MPU, and PPU. The Arm Cortex-A53 processor on the Zynq UltraScale+ implements all four of the exception levels necessary for isolating and managing the individual software stacks. It also supports both execution levels for AArch32 and AArch64 architectures at every exception level.
The MMU in the Zynq UltraScale+ facilitates the hypervisor to virtualize a software stack by performing a second address translation so that the virtual address is translated to an intermediate physical address prior to translating it to the actual physical address. This allows for a software stack to believe that it has control of the actual physical address, which vastly simplifies the workload of the hypervisor. While each core of the Arm Cortex-A53 has its own MMU the Zynq UltraScale+ contains a SMMU (System MMU) sits one level above and acts as interconnect between each core and the other peripherals, as well as any logic located in the programmable fabric of the FPGA. The standard MIO peripherals (SPI, UART, I2C, SD, etc.) external to the Arm cores are protected by the Xilinx PPU (XPPU) while the high speed peripherals (DDR, high-speed AXI buses, OCM, etc.) are protected by the Xilinx MPU (XMPU).
The XMPU also handles dividing memory regions into unprotected versus protected areas for the Arm TrustZone architecture to use. The Arm TrustZone works by adding an extra bit to the AXI interface used within the Zynq UltraScale+ for secure transactions within the chip. This means that software within the Zynq UltraScale+ part can control the security of its communication with its peripherals and implement it via hardware.
Getting started with the Zynq UltraScale+ chip is made straightforward by the availability of development boards for it such as the ZCU102 (check out an unboxing for it here). The ZCU102 in particular is a development board sold as a full kit that provides a user with a full platform for a Zynq UltraScale+ chip to start, prove-in, and iterate on their design. The ZCU102 offers just about every peripheral a potential application could need:
- RJ45 Ethernet connector
- PMOD I/O
- FMC breakout headers
- 4GB of DDR4
- PMBUS and system controller MSP430 for power, clocks, and I2C bus switching
- SATA (1 x GTR), PCIe Gen2x4 root port
- HDMI input and output
- VESA DisplayPort 1.2 source-only controller supports up to two lanes of main link data at rates of 1.62 Gb/s, 2.70 Gb/s, or 5.40 Gb/s
- UART To USB bridge
- JTAG over USB so an external JTAG programmer is not needed
- The ability to boot from dual quad-SPI flash memory or an SD card
- CAN header
- 4x SFP+ cage
- USB2/3 (MIO ULPI and 1 GTR) connector
- System clocks, user clocks, jitter attenuated clocks
- 12V wall adapter or ATX power supply
While the majority of final designs won't need such a huge arsenal of peripherals on hand, it's invaluable to have them available as you are first starting a design and still flushing out requirements. I've personally used the ZCU102 on a few projects in the past and the ability to switch out and test different hardware interfaces for my designs on the same board definitely saved me a lot of time. Having worked on software defined radio development, the FMC breakout connectors make the ZCU102 an easy choice since many transceiver development boards use FMC connectors for their high speed data transfer properties.
One of the most challenging aspects of laying out a custom FPGA board for a project is being able to predict and calculate power supply requirements for the design. Particularly for AI models as they can be somewhat of a moving target. This is another huge advantage of starting a design on a development board like the ZCU102, the power supply is robust and while it'll most likely be overkill for the design's final needs, it's again a huge time saver to not have to worry about it up front. Tools such as the power analysis in the implemented design tab of Vivado let you calculate the design's power draw as you iterate on it.
The most valuable resource that comes with using the ZCU102 are the board support packages (BSPs) for it in both Vivado and PetaLinux which provide all the needed hooks for the hardware on the board so a user can immediately jump into working on their own design instead of spending too much time just setting up the base project.
The PetaLinux BSP for the ZCU102 include the drivers necessary for jumpstarting a virtualization design with the Xen drivers in the root file system, device tree, and kernel settings already included. So the ZCU102 has been one of the boards I reach for to hit the ground running on a new project. It demonstrates all of the hardware acceleration, low latency architecture, and performance specifications the Xilinx Zynq UltraScale+ FPGAs are capable of that make it such as appealing choice to designers.
Edge computing is a huge topic in the FPGA world at the moment, for AI computing and pretty much every other application that connects to a cloud-like network. It's more than likely if you work as an FPGA developer that you'll find yourself working on an edge computing based project at some point. Hopefully this deep dive has provided some insight as how to take full advantage of FPGA hardware via virtualization.
All thoughts/opinions are my own and do not reflect those of any company/entity I currently/previously associate with.