The AMD-Xilinx Video SDK [1] is a complete software stack solution enabling video streaming developer to seamlessly leverage the hardware accelerated features of the Video Codec Units (VCU) on Alveo U30 card. The SDK enable high-density real-time transcoding for video on demand or live video streaming services, such as live broadcast, telemedicine, distance learning, eSport, live gaming, social video networking and live sports broadcast. The Video SDK has pre-compiled version of FFmpeg, which integrates key video transcoding plug-ins, enabling simple hardware offloading of compute-intensive workloads from CPU to U30 FPGA. The Video SDK also provides a C-based programming interface (API), for accessing low level FPGA VCU kernels for integrating into video streaming pipeline application. The Alveo U30 video transcoding support three use cases, Live Transcoding, Adaptative Bitrate (ABR) transcoding and Faster Than Real Time (FTRT) transcoding.
Overview of live streaming workflowThe diagram below shows an overview of live video streaming workflow. It starts with some live production event on the left. The video file or the video live streams must be prepared to be distributed as scale. This is done by ranscoding video streams as per step 2, 3, and 4.
1. Demultiplex incoming video streams.
2. Decode the input video stream.
3. Generate multiple lower resolution video streams using scaler unit.
4. Encode into multiple lower resolution streams.
5. Multiplex encoded video streams.
Package the streaming content into HTTP Live Streaming (HLS) [2] for delivery to end user.
These streams are distributed to different clients based on available network bandwidth, provide us with consistent experience regardless of network connectivity conditions and end user devices processing power and screen resolution.
Alveo U30 Video Transcoding Acceleration CardAlveo U30 accelerator card [3, 4] is high density media processing PCIe acceleration card. The U30 card is powered by two Zynq UltraScale MPSoC devices with H.264/H.265 VCU. Each VCU supports maximum aggregated bandwidth of 4k (3840x2160) resolution at 60 frames per second (fps). Each U30 Card supports simultaneous encoding and decoding of up to 48 streams at lower resolution as shown in the table below. The U30 operates at a very low power of 25W. AMD-Xilinx have tested video transcoding performance on different number of Alveo U30 accelerator cards as shown in the table.
The table shows how many video streams of different resolution and frame rate can be simultaneously transcoded by different number of cards.
Alveo U30 Transcoder use casesAlveo U30 card video transcoding is optimized for low latency real time video streaming applications. The U30 provide deterministic low latency transcoding. It supports three types of transcoding [1].
1.Live transcoding:
Alveo U30 card is primarily targeted at real time live streaming video workloads. It is expected that one or more sources of video input, either from files or from live video streams, are fed into the transcoding pipeline. The encoder encodes one or more output streams. The video streams can be converted from one format to a different, i.e., from H.264 to H.265 or vice versa, or/and to a reduce bit rate.
2. Adaptive Bitrate (ABR) Transcoding:
In video streaming applications, video is distributed in different resolutions and bit rate to adapt to varying network bandwidth conditions. The ABR scaler downscales an input video stream to several smaller resolution streams, before re-encoding them. The ABR transcoding supports up to 32 input video streams per Zynq MPSoC device with maximum total bandwidth of 4kp60. Each stream is then scaled up to 8 lower resolution outputs, or lower frame rate. It supports spatial resolutions from 3840x210 to 128x128, in multiple of 4 as shown in the diagram.
3.Faster Than Real Time (FTRT) Transcoding:
In FTRT [1] transcoding the input video file is split into multiple smaller segments, and each segment is individually transcoded. The more devices available, the more segments can be processed in parallel. There is some overhead in “splitting” the clip into segments, and “stitching” back the segments into a single output file. The input video clip should be longer than 2 seconds to achieve improvement in throughput. The FTRT is an important use case for video stream provider in getting new video content to video streaming platform as soon as possible. This provide the streaming distributor with competitive edge.
Performance transcoding a Buck Bunny 1920x1080p video file of duration, 10 minutes 34 seconds with total of 41, 640 frames using Faster Than Real Time transcoding on different U30 cards is shown in the table.
AMD-Xilinx provide software solution stack [5] to simplify and speed up development time of video streaming application pipeline. The developer can use FFmpeg command line interface or C-based API to create their custom video streaming pipeline. As shown in the diagram below, at the bottom of the stack is the U30 accelerator card. The Alveo U30 binary image (XCLBIN) is provided. The XRT API enable host processor (x86) to communicate and control the Alveo U30 accelerator card operation. The C-based API (XRM) allow developer to develop their video streaming pipeline and control low level kernels. AMD-Xilinx have developed an implementation of FFmpeg, with plugins for H.264 encode, H.265 (HEVC) encode, H.264 decode, ABR scaler and look-a-head functions.
Video streaming developers have two options to develop their custom video streaming pipeline:
1. Using FFmpeg command line interface with several provided example to start with.
2. Using XMA and C-API interface. AMD-Xilinx provide instruction on how to build your own applications at more primitive level.
For developing video streaming pipeline video SDK packages must be installed on a server or premises.
1. The procedure for installing Ubuntu 18.04 server on PC with default kernel without GUI:
- Install Ubuntu Server 18.04 LTS following reference [1]. Download “ubuntu-18.04.6-live-server-amd64.iso” image from reference [6].
- Download Etcher [7] onto your PC.
- Using Etcher, flash the Ubuntu 18.04 server image on to the USB card for creating bootable image on the USB.
- Make sure your PC Bios menu boot order is set to boot from USB.
- Ensure in your PC Bios secure boot is disabled, otherwise PC secure boot will not allow detection of U30 devices kernel drivers.
- Insert the USB with bootable image into your PC driver. The Ubuntu 18.04 server should start the installation procedure on to your PC from the bootable “ubuntu-18.04.6-live-server-amd64.iso” image on the USB. To complete the installation, follow the steps in reference [8, 9].
2. Install Linux 5.4 kernel for Ubuntu 18.04 server using the following command:
sudo apt-get install -install-recommends linux-generic-hwe-18.04
3. Install GUI and Gnome tweaks application on Ubuntu 18.04 server using the following commands:
sudo apt install ubuntu-gnome-desktop
sudo apt install gnome-tweaks
4. Install the Alveo U30 cards into the desktop PCIe.
5. When PC Operating System (OS) secure boot is on, U30 kernel driver could not be detected as shown in below.
When the PC OS secure boot has been turned off the kernel drivers can be detected as shown below:
6. Install AMD-Xilinx Video SDK on Ubuntu 18.04 server. For installing video SDK follow step one to seven in reference [10]. After installation has been completed successfully, the following folders and files can be found under video-SDK folder, which contains video SDK documentation, docker container, license, C-based source code, example scripts.
The developer can either start with FFmpeg example scripts, or C-based API (XMA).
The video SDK FFmpeg example scripts on basic filters for video rotation, logo overlay, crop and zoom, and video compositing several screens together like zoom conference. The quality_analysis scrips enable developers to evaluate the quality of transcoded video file compared to the raw video file.
As part of Video-SDK several tutorial scripts are provided as a starting point for developing custom video streaming pipeline
7. Sourcing the setup script should be performed each time you open a new terminal on your Ubuntu server system. This is required for the environment to be correctly configured. The setup script exports important environment variables, starts the AMD-Xilinx Resource Manager (XRM) daemon, and ensures that the AMD-Xilinx devices and the XRM plugins are properly loaded. It also moves to the top of the system path the FFmpeg binary provided as part of the AMD-Xilinx Video SDK
The setup script found two U30 cards. The script also shows that the image has been loaded into device.
Faster Than Real Time Transcoding on Alveo U30Some of the provided tutorial examples read or write RAW files from disk (encode-only or decode-only pipelines). There is a chance that due to the massive bandwidth required for operating on these RAW files, you will notice a drop in FPS, this is not due to the Xilinx Video SDK but the disk speeds. It is recommended reading/writing from "/dev/shm" shared memory, which is a RAM disk. In this example of FTRT the Big Buck Bunny video file “bbb_1080p_30fps_normal.mp4” [11] is copied to RAM disk.
When processing file-based video clips, it is possible to run faster than real time (FTRT). The 13_ffmpeg_transcode_only_split_stitch.py script starts by automatically detecting the number of devices available in the system and then determines how many jobs can be run on each device based on the resolution of the input file. The input file is then split into many segments and aligning the segments on to Group Of Picture (GOP) boundaries, that is not to lose important compression information and re-sync with the encoder. Parallel FFmpeg jobs are submitted to transcoder, where all the segments are processed simultaneously. Once all the segments have been encoded, all segments are concatenated for generating the final output video stream file. The FTRT python script arguments are, y for overwrite output, -s for input video file, -d for output generated video file.
Python3 13_ffmpeg_transcode_only_split_stitch.py y -s <INPUT_FILE> -d <OUTPUT_FILE>
The transcoding only split, and stich script identified two Alveo U30 cards and each U30 card have two Zynq MPSoC devices. It has also been identified that the input video file (Big Buck Bunny) is approximately 10 minutes long, with 1920x1080 resolution at 30 FPS. Based on the available resources the input video file has been split into 16 segments, where each segment is separately transcoded. The 16 segments are transcoded in parallel in only 38 seconds, and all the segments are stitched together for generating the final transcoded output file. The video clip was transcoded 16.7 times faster than real time, processing 500.53 FPS. This results in efficiency of approximately 104 %. It is understood the extra 4% efficiency is due to the differences in manufacturing of the silicon chips.
FFmpeg ffplay has been used to play the FTRT generated output video file “output_FTRT.mp4” as per attached clip.
References:
[1]: https://xilinx.github.io/video-sdk/v1.5/index.html
[2]: https://developer.apple.com/streaming/
[3]: https://www.xilinx.com/products/boards-and-kits/alveo/u30.html
[4]:https://www.xilinx.com/content/dam/xilinx/support/documents/data_sheets/ds970-u30.pdf
[5]: https://www.xilinx.com/content/dam/xilinx/publications/solution-briefs/u30-sdk-solution-brief.pdf
[6]: https://releases.ubuntu.com/18.04/
[7]: https://www.balena.io/etcher/
[8]: https://ubuntu.com/tutorials/install-ubuntu-server#1-overview
[9]: https://www.fosslinux.com/6406/how-to-install-ubuntu-server-18-04-lts.htm
[10]: https://xilinx.github.io/video-sdk/v1.5/getting_started_on_prem.html
[11]: https://test-videos.co.uk/bigbuckbunny/mp4-h264
Revision History27/04/2022 - Initial version
Acknowledgements
Thanks to Sergei Storogev from AMD-Xilinx for his valuable in depth knowlege and help.
Comments
Please log in or sign up to comment.