Benchmarking TensorFlow and TensorFlow Lite on Raspberry Pi 5

Using TensorFlow Lite models on the Raspberry Pi 5 now offer similar inferencing performance to a Coral TPU accelerator.

All the way back in 2019 I spent a lot of time looking at machine learning on the edge. Over the course of about six months I published more than a dozen articles on benchmarking the then new generation of machine learning accelerator hardware that was only just starting to appear on the market, and gave a series of talks around the findings.

A lot has changed in the intervening years, but after a getting a recent nudge I returned to my benchmark code and — after fixing some of the inevitable bit rot — I ran it on the new Raspberry Pi 5.

Headline results from benchmarking

Running the benchmarks on the new Raspberry Pi 5 we see significant improvements in inferencing speed, with full TensorFlow models running almost ×5 faster than on they did on Raspberry Pi 4. We see a similar increase in inferencing speed when using TensorFlow Lite, with models again running almost ×5 faster than on the Raspberry Pi 4.

However perhaps the more impressive result is that, while inferencing on Coral accelerator hardware is still faster than using full TensorFlow models on the Raspberry Pi 5, the new Raspberry Pi 5 has similar performance when using TensorFlow Lite to the Coral TPU, displaying essentially the same inferencing speeds.

ℹ️ Information As per our previous results with the Raspberry Pi 4 we used active cooling with the Raspberry Pi 5 to CPU temperature stable and prevent thermal throttling of the CPU during inferencing.

The conclusion is that custom accelerator hardware may no longer be needed for some inferencing tasks at the edge, as inferencing directly on the Raspberry Pi 5 CPU — with no GPU acceleration — is now on a par with the performance of the Coral TPU.

ℹ️ Information The Coral hardware makes use of quantization the same way TensorFlow Lite does to reduce the size of models. However to use a TensorFlow Lite model with Edge TPU hardware there are a few extra steps involved. First you need to convert your TensorFlow model to the optimized FlatBuffer format to represent graphs used by TensorFlow Lite. But additionally you also need to compile your TensorFlow Lite model for compatibility with the Edge TPU using Google’s compiler.

Conclusion

Inferencing speeds with TensorFlow and TensorFlow Lite on the Raspberry Pi 5 are significantly improved over Raspberry Pi 4. Additionally, the Raspberry Pi 5 now offers similar performance to the Coral TPU.

Part I — Benchmarking

A more in-depth analysis of the results

In our original benchmarks show we saw that the two dedicated boards, the Coral Dev Board from Google and the JetsonNano Developer Kit from NVIDIA, were the best performing out of our surveyed platforms. Of these two boards the Coral Dev Board ran significantly faster, with inferencing times around ×4 shorter than the Jetson Nano for the same machine learning model.

However, at the time the benchmarking results made me wonder whether we had gone ahead and started to optimize things in hardware just a little too soon.

The significantly faster inferencing times we saw then from models which making use of quantization, and the dominance of the Coral platform which also relied quantization to increase its performance, suggested that we should still be exploring software strategies before continuing to optimize accelerator hardware any further.

These results from benchmarking on the Raspberry Pi 5 seem to bear my original doubts out. It has taken four years for general CPUs to catch up with what was then the best in class accelerator silicon. While a new generation of accelerator hardware is now available which will be more performant — and yes, I'll be looking at that when I can — the Coral TPU is still seen by many as "best in class" and is in widespread use despite a lack of support from Google for their accelerator platform.

The Raspberry Pi 5 is now performant enough to keep up with inferencing in real-time video and performs on a par with the Coral TPU, and the results imply that for many use cases Coral hardware could be replaced for a significant cost saving by a Raspberry Pi 5 without any performance degradation.

Summary

Due to the lack of support from Google for the pycoral library — updates seems to have stopped in 2021 and the library no longer works with modern Python distributions — along with the difficulty in getting Coral TPU hardware to work with modern operating systems the significant reduction in inferencing times we see on the new Raspberry Pi 5 is very welcome.

Part II — Methodology

About the benchmarking code

Benchmarking was done using TensorFlow, or for the hardware accelerated platforms that do not support TensorFlow their native framework, using the same models used on the other platforms converted to the appropriate native framework.

For the Coral EdgeTPU-based hardware we used TensorFlow Lite, and for Intel’s Movidius-based hardware we used their OpenVINO toolkit. Benchmarks were carried out twice on the NVIDIA Jetson Nano, first using vanilla TensorFlow models, and a second time using those models after optimization using NVIDIA's TensorFlow with TensorRT library.

Inferencing was carried out with the MobileNet v2 SSD and MobileNet v1 0.75 depth SSD models, both models trained on the Common Objects in Context (COCO) dataset. The 3888×2916 pixel test image was used containing two recognizable objects in the frame, a banana🍌 and an apple🍎. The image was resized down to 300×300 pixels before presenting it to the model, and each model was run 10,000 times before an average inferencing time was taken.

ℹ️ Information The first inferencing run, which can take up to ten times longer due to loading overheads, is discarded from the calculation of the average inferencing time.

While in the intervening years other benchmark frameworks have emerged that are arguably more rigorous, the benchmarks presented here are intended to reflect real world performance. A number of the other newer benchmarks measure the time to complete only the inferencing stage. While that’s a much cleaner (and shorter) operation than the timings measured here — which include set up time — most people aren’t really interested in just the time it takes between passing a tensor to the model and getting a result. Instead they want end-to-end timings.

One of the things that these benchmarks don't do is optimization. They take an image, pass it to a model, and measure the result. The code is simple, and what it measures is comparable to the performance an average developer doing the same task might get, rather than an experienced machine learning researcher than understands the complexities and limitations of the models, and how to adapt them to individual platforms and situations.

Setting up your Raspberry Pi

Go ahead and download the latest release of Raspberry Pi OS and set up your Raspberry Pi. Unless you’re using wired networking, or have a display and keyboard attached to the Raspberry Pi, at a minimum you’ll need to put the Raspberry Pi on to your wireless network, and enable SSH.

Once you’ve set up your Raspberry Pi go ahead and power it on, and then open up a Terminal window on your laptop and SSH into the Raspberry Pi.

ssh pi@raspberrypi.local

Once you’ve logged in you can install TensorFlow and TensorFlow Lite.

⚠️Warning Starting in Raspberry Pi OS Bookworm, packages installed via pip must be installed into a Python virtual environment. A virtual environment is a container where you can safely install third-party modules so they won’t interfere with your system Python.

Installing TensorFlow on Raspberry Pi 5

Installing TensorFlow on the Raspberry Pi is a lot more complicated than it used to be, as there is no longer an official package available. However fortunately there is still an unofficial distribution, which at least means we don't have to resort to building and installing from source.

sudo apt install -y libhdf5-dev unzip pkg-config python3-pip cmake make git python-is-python3 wget patchelf
python -m venv --system-site-packages ~/.python-tf
source ~/.python-tf/bin/activate
pip install numpy==1.26.2
pip install keras_applications==1.0.8 --no-deps
pip install keras_preprocessing==1.1.2 --no-deps
pip install h5py==3.10.0
pip install pybind11==2.9.2
pip install packaging
pip install protobuf==3.20.3
pip install six wheel mock gdown
pip install opencv-python
TFVER=2.15.0.post1
PYVER=311
ARCH=`python -c 'import platform; print(platform.machine())'`
pip install --no-cache-dir https://github.com/PINTO0309/Tensorflow-bin/releases/download/v${TFVER}/tensorflow-${TFVER}-cp${PYVER}-none-linux_${ARCH}.whl

Installing TensorFlow Lite on Raspberry Pi 5

There is still an official TensorFlow Lite runtime package available for Raspberry Pi, so installation is much more simple than for full TensorFlow where that option is no longer available.

python -m venv --system-site-packages ~/.python-tflite
source ~/.python-tflite/bin/activate
pip install opencv-python
pip install tflite-runtime

Running the benchmarks

The benchmark_tf.py script is used to run TensorFlow benchmarks on Linux (including Raspberry Pi) and macOS. This script can also used — with a TensorFlow installation which includes GPU support — on NVIDIA Jetson hardware.

source ~/.python-tf/bin/activate
./benchmark_tf.py --model PATH_TO_MODEL_FILE --label PATH_TO_LABEL_FILE --input INPUT_IMAGE --output LABELLED_OUTPUT_IMAGE --runs 10000

For example on a Raspberry Pi, benchmarking with the MobileNet v2 model for 10,000 inference runs the invocation would be:

./benchmark_tf.py --model ssd_mobilenet_v2/tf_for_linux_and_macos/frozen_inference_graph.pb --label ssd_mobilenet_v2/tf_for_linux_and_macos/coco_labels.txt --input fruit.jpg --output output.jpg --runs 10000

This will output an output.jpg image with the two objects (the banana and the apple) labelled.

The benchmark_tf_lite.py script is used to run TensorFlow Lite benchmarks on Linux (including Raspberry Pi) and macOS.

source ~/.python-tf-lite/bin/activate
./benchmark_tf_lite.py --model PATH_TO_MODEL_FILE --label PATH_TO_LABEL_FILE --input INPUT_IMAGE --output LABELLED_OUTPUT_IMAGE --runs 10000
⚠️Warning Models passed to TensorFlow Lite must be quantized. To do so the model must be converted to TensorFlow Lite format.

Getting the benchmark code

The benchmark code is now available on GitHub. The repository includes all the resources needed to reproduce the benchmarking results, including models, code for all the tested platforms, and the test imagery used. There is also an ongoing discussion about how to improve the benchmark to make it more easily run on new hardware.

Alasdair Allan
Scientist, author, hacker, maker, and journalist. Building, breaking, and writing. For hire. You can reach me at 📫 alasdair@babilim.co.uk.
Latest articles
Sponsored articles
Related articles
Latest articles
Read more
Related articles