This series is covers the optimization steps for running fast, low-power ML on the MaaXBoard OSM93.
To demonstrate the optimization process, we'll be building an actual pose detection application using MediaPipe's models.
The image below compares the MaaXBoard OSM93 with three other state of the art Edge AI boards: the i.MX8M+ Edge AI Kit, the QCS6490 Vision-AI Kit, and the NVIDIA Jetson Orin Nano.
While many newer edge AI boards on the market offer double digit TOPs, the NPU on MaaXBoard OSM93 is only 0.5 TOPs. However, with just a bit of optimization, this board is capable of fast inference for machine vision projects ranging from image classification to image segmentation.
If you buy a board with more compute power than you need, not only will you pay more for it up front; you'll also pay more for the lifetime of the board, as more TOPs generally means more power consumption.
In this project, we will cover:
- The Project: Pose Detection
- Quantization: what, why, how?
- Post-training quantization of Tensorflow Lite models for MaaXBoard OSM93
- Benchmarking various types of post-training quantization on MaaXBoard OSM93
Pose detection can be used for many different applications, from loss prevention to fall detection. In this series, we'll be looking at using pose detection for a fitness application.
After reviewing a number of pose models, I finally settled on the MediaPipe models. MediaPipe fit my requirements for a lightweight model with good accuracy when detecting a single person in streaming video.
MediaPipe actually consists of two models:
- Pose detection: finds a bounding box determining where a person is on screen
- Landmark detection: Takes the region of interest (ROI) from the previously detected bounding box and finds points, or landmarks, such as eyes, nose, arms, legs, and torso.
The MediaPipe framework can be installed with one command:
pip install mediapipe
Unfortunately, the framework is bulky and can't be accelerated on the NPU. On the MaaXBoard OSM93 it runs at a laggy 2 frames per second!
Thankfully, Mario Bergeron created a custom python framework. His project here dives deeper into these models and how to run them using his custom framework: https://www.hackster.io/AlbertaBeef/blazing-fast-models-972750
QUANTIZATION: WHAT, WHY, HOW?What is quantization?Model quantization transforms a machine learning model's 32-bit floating point weights (and sometimes other parameters) to 8 or 16-bit weights.
The quantized model is smaller - quantization reduces the size in bytes - so it takes up less memory. Quantization also speeds up inference time, because there are fewer numbers to process. An added benefit of fewer calculations is lower power consumption.
Why is quantization necessary?To run accelerated machine learning on the Ethos-U65 NPU on the MaaXBoard OSM93, you must first quantize your models. This NPU ONLY supports INT8, UINT8 or UINT16 operations.
The Vela compiler, which further optimizes the model for performance and power, requires that the models be quantized to UINT8 or INT8 before conversion (the Vela compiler will be covered in Part 2 of this series).
By quantizing, you can 5 to 10X your performance!
Types of QuantizationWe will be doing post-training quantization in this project. Post-training quantization, like it sounds, is done after the model has been trained.
Quantization-aware training is done during training.
As you might have guessed, converting high-precision floating-point numbers to lower-precision integers can cause accuracy loss. Quantization-aware training minimizes the accuracy loss - however, in my experience, even post-training quantization results in negligible accuracy loss.
The graph below compares accuracy numbers for quantization-aware training and post-training quantization of three Tensorflow models:
We'll be quantizing a Tensorflow model because the MaaXBoard OSM93 NPU supports the Tensorflow Lite framework for inference.
Getting StartedThe first part of this project will be done entirely on your host computer. You can use Linux, Mac, or Windows, but the NXP eIQ Toolkit is only available for Linux or Windows.
- Download the MediaPipe Models here: (these are the latest v0.10 models) Pose detection model, Pose landmark model (full)
- Download and install the NXP eIQ ToolkitOR
- Download and install Tensorflow
There are two relatively easy ways to do post-training quantization on a Tensorflow Lite model:
1.) Use the Tensorflow Lite Converter
Using Tensorflow's quantization tools provides the most flexibility.
2.) Use the eIQ Toolkit GUI
The NXP eIQ Toolkit provides a graphical user interface to easily quantize models.
Quantization StepsThe following steps are required for both methods:
The Tensorflow model must be in SavedModel format to be quantized. MediaPipe only provides models in Tensorflow lite format, so we'll have to convert it.
Note: I've provided both detection and landmark models in saved_model format within my repo, so if you prefer you can skip this step.
To convert the Tflite model to Tensorflow SavedModel format, we'll use the scripts provided here: https://github.com/PINTO0309/tflite2tensorflow
The easiest way to use this script is in the docker container, although there are also instructions to install it with pip and run it on your PC. Make sure you have Docker installed and pull the Docker environment:
docker pull ghcr.io/pinto0309/tflite2tensorflow:latest
Run the docker environment:
sudo docker run -it --rm \
-v `pwd`:/home/user/workdir \
ghcr.io/pinto0309/tflite2tensorflow:latest
Run the script to output the saved_model file (do this for both pose_detection and pose_landmark models):
tflite2tensorflow \
--model_path pose_detection.tflite \
--flatc_path ../flatc \
--schema_path ../schema.fbs \
--output_pb
If this succeeds, you should now see a folder with the saved_model file:
The .tflite model is now ready to be quantized.
2. Gather a representative datasetFull integer quantization using eIQ or the TensorflowLite tools automatically runs a few inference cycles to calculate the range (min, max) of variable tensors (i.e. input, activations, and output). As a result, the converter requires a representative dataset to calibrate them.
To create a representative dataset, you want to use training samples that your model performs well on. I only used 10 samples, but for better accuracy you might want to use as many as 500. You'll want to choose samples for which the model has good accuracy. The size of representative dataset and the specific samples can be treated as hyperparameters that you tune for accuracy.
I chose a mix of samples from the CocoPose dataset and from Human Pose Estimation on Kaggle.
The samples I chose are in the repo under the "representative dataset" directory.
The model takes an input of 224x224 pixels, so I've resized the images to match that format.
3.A Quantize with the Tensorflow Lite Converter(If you prefer to use the GUI-based method, skip ahead to section 3.B)
Tensorflow allows you to do various different types of quantization, including Dynamic Range, 16-bit, Full Integer, and Integer Only.
In the table below, I've benchmarked all of these types of quantization on the MaaXBoard OSM93 so you don't have to:
The takeway is that for this board, there are only two types of quantization that you should consider:
- Full Integer quantization: quantizes all tensors EXCEPT input and output to 8-bit.
- Integer Only quantization: quantizes all tensors INCLUDING input and output to 8-bit
Remember that the NPU only supports quantized operations, so any input and output calculations for a Full Integer quantized model will be run on the CPU.
To run quantization, use the quantization.py script.
python3 quantization.py
You should end up with a file named "model_quantized_int8.tflite"
3.B Quantize with the eIQ Toolkit GUI- Select "Model Tool"
- Select "Open Model" and select the detection model saved_model.pb file
- Select the hamburger menu in the top left corner
- Select "Convert"
- Select "Tensorflow Lite (.tflite) (eiq-converter-tflite)"
- Enter the model name (this gets embedded in the RTM; it doesn't become the file name)
- Check the "Enable Quantization" box. More settings will pop up.
- Leave "per channel" as the quantization type. "Maintain Existing Data Type" for the Input and Output types.
- You can select the Quantization Normalization. I left it signed.
- Select "Convert" and select the name you would like the output tflite file to have.
You should now have a .tflite file that has been quantized. You can check it out in the eIQ Model Viewer by selecting "Open" from the hamburger menu.
4. Verify Model AccuracyVerifying the model accuracy requires downloading and running the MediaPipe framework. To keep this project from getting too long, I've created a separate project on how to do this here.
HOW TO BENCHMARKIt's super easy to benchmark models using the Tensorflow tool on your MaaXBoard OSM93.
Boot up your MaaXBoard OSM93 (check out this project for getting started details) and let's see how our quantized models compare to the original models.
The benchmark script is located under "/usr/bin/tensorflow-lite-[VERSION]/examples" e.g. /usr/bin/tensorflow-lite-2.10.0/examples
- First, let's benchmark the original pose_detection.tflite model on the CPU. The i.MX93 CPU has two cores so we specify two threads:
./benchmark_model--graph=pose_detection.tflite \
--num_threads=2
- Next, let's benchmark the pose_detection_full_quant.tflite on the ARM Ethos-U65 NPU. We'll specify the external delegate path to make sure inference runs on the NPU:
./benchmark_model--graph=pose_detection_full_quant.tflite \
--external_delegate_path=/usr/lib/libethosu_delegate.so
NEXT STEPSNow that the we've covered quantization, we'll be looking at Vela conversion next. Check it out in Part 2!
Comments
Please log in or sign up to comment.