TensorFlow Lite's New FP16 Half-Precision Support Sees Inference Performance Double on Arm Devices
If you're targeting a device with a modern Arm chip, on-device machine learning is about to get a serious speed-up.
The TensorFlow team has announced a new release of TensorFlow Lite which near-doubles performance for on-device CPU-based inference for devices with Arm processors — by enabling a half-precision mode in the XNNPack back-end.
"For a long time FP16 [half-precision] inference on CPUs primarily remained a research topic, as the lack of hardware support for FP16 computations limited production use-cases," explain TensorFlow engineers Marat Dukhan and Frank Barchard in a joint post on the topic. "However, around 2017 new mobile chipsets started to include support for native FP16 computations, and by now most mobile phones, both on the high-end and the low-end. Building upon this broad availability, we are pleased to announce the general availability for half-precision inference in TensorFlow Lite and XNNPack."
The support for FP16 precision, from the default IEEE 754 single-precision FP32 mode, brings with it the promise of major speed gains on compatible devices: TensorFlow's internal testing demonstrates an average just short of double the performance across a range of common models, including MobileNet v2 and MobileNet v3-Small image classification, DeepLab v3 segmentation, BlazeFace face detection, and SSDLite and Objectron object detection models, for a range of mobile devices.
The new functionality isn't universal to all Arm targets, though. "Currently, the FP16-capable hardware supported in XNNPack is limited to ARM & ARM64 devices with ARMv8.2 FP16 arithmetics extension, which includes Android phones starting with Pixel 3, Galaxy S9 (Snapdragon SoC [System-on-Chip]), Galaxy S10 (Exynos SoC), iOS devices with A11 or newer SoCs, all Apple Silicon Macs, and Windows ARM64 laptops based with Snapdragon 850 SoC or newer," Dukhan and Barchard admit.
"To benefit from the half-precision inference in XNNPack, the user must provide a floating-point (FP32) model with FP16 weights and special 'reduced_precision_support' metadata to indicate model compatibility with FP16 inference. When the compatible model is delegated to XNNPack on a hardware with native support for FP16 computations, XNNPack will transparently replace FP32 operators with their FP16 equivalents, and insert additional operators to convert model inputs from FP32 to FP16 and convert model outputs back from FP16 to FP32.
"If the hardware is not capable of FP16 arithmetics, XNNPack will perform model inference with FP32 calculations. Therefore, a single model can be transparently deployed on both recent and legacy devices."
The latest version of TensorFlow Lite is available on GitHub under the permissive Apache 2.0 license; Dukhan and Barchard have confirmed that the team is now looking to expand FP16 support to compatible x86 devices "in a future release."