This series covers optimizing machine learning model to run on the MaaXBoard OSM93. This board comes with a neural processing unit (NPU) that's capable of running low-power, fast machine learning. To squeeze as much performance as possible out of the NPU, some optimization is required.
In Part 1 of this series, we looked at quantization. Today we'll look at Vela conversion.
What is Vela?Vela is a tool that is used to compile a TensorFlow Lite for Microcontrollers neural network model into an optimized version that can run on an embedded system containing an Arm Ethos-U NPU. The original Arm Vela compiler has been adapted by NXP for their boards that use the Arm Ethos-U NPU.
After compilation, the model is still in Tensorflow Lite format but can ONLY run on supported hardware.
What does Vela actually do?Vela does graph optimization of the model. This includes things like fusing operations, pruning, batching and reordering, and memory optimizations like sharing and reuse.
How does Vela compare to other optimization methods?There are many methods for optimizing machine learning models to run on machine learning accelerators.
Some processors, like NXP's i.MX8M+, doesn't require any model compilation. It can run Tensorflow models directly on the NPU. This makes it easier by removing the compilation step, but it has the tradeoff of running with higher power and less throughput than if it were fully optimized. Other methods of optimization, such as TVM, can get deeply technical.
Because Vela is so specific - only targeting the Arm Ethos-U NPU - it's possible to optimize the model with great results without much effort.
LimitationsNXP's Vela Tool doesn't support all of the operators that Tensorflow Lite Supports.
Supported operators:
Supported operators for the latest version of Vela can be found here, along with specific constraints for each operator.
Prerequisites
To convert a model using Vela, it must be quantized to UINT8 or INT8 format (check out the project on how to do that here).
Methods
There are three methods for conversion:
- Compile on board
- Use the eIQ Toolkit
- Use the command line tool
This is the easiest way. The vela tool comes preloaded with the board's Linux image. Set up your board as detailed in the project "Getting Started With Machine Learning on MaaXBoard OSM93."
Move your quantized model to the MaaXBoard OSM93. Now just run one command, e.g.:
vela pose_detection_int_only_quant.tflite
Your terminal will print a summary of the operations that were quantized, as well as the estimated inference time. To learn more about how it gets these performance numbers, see Vela Performance Estimation Summary. You can also use the --verbose-performance
command to print per-layer performance stats.
With that single command, you are done!
The vela compiled model will be in a folder named "output" and will be named pose_detection_int_only_quant_vela.tflite. There will also be a CSV file with the same details that were printed in the terminal summary about the converted model.
Note: add a swap file to the board if converting models larger than a couple GB because conversion is memory intensive.
2. Use the eIQ ToolkitAnother easy way to convert to Vela on your host PC is by using the NXP eIQ Toolkit. This tool runs on both Windows and Linux. Installation is simple.
Once the tool is installed, open it.
- Select "Model Tool"
- Select "Open Model" and select your quantized model.
- Open the hamburger menu and select the "Convert" option.
- Under conversion options, select Tensorflow Lite Vela/i.MX93 (.tflite) (eiq-converter-armvela)
- Ciick "Convert":
Your converted Tensorflow Lite model will show up in the folder, along with a CSV file showing a summary of the converted model.
3. Use the command line toolThe command line tool offers the most options. The Vela converter is open source, so changes can even be made to the source code if that level of control is desired.
INSTALLATION
The Vela command line tool runs on Linux and Windows 10. Check the ethos-u vela repository on github for the latest install instructions.
Currently, Vela depends on the following versions of Tensorflow and Python:
- Vela 3.12.0 to current supports TensorFlow 2.16
- Vela 3.10.0 to current supports Python 3.10 (3.9)
Install the development version of Python 3.10 containing the Python/C API header files, e.g. apt install python3.10-dev
or yum install python310-devel
Additionally, install:
- pip3
- C99 capable compiler and associated toolchainFor Linux operating systems, a GNU toolchain is recommended.For Microsoft Windows 10, the Microsoft Visual C++ 14.2 Build Tools are recommended. See https://wiki.python.org/moin/WindowsCompilers
Install Vela from PyPi using the following command:
pip3 install ethos-u-vela
Alternatively: clone the git repository and running pip install:
git clone https://review.mlplatform.org/ml/ethos-u/ethos-u-vela.git
cd ethos-u-vela
pip3 install .
OR if you want to modify the source code, run:
pip3 install -e .[dev]
The -e.[dev] installs the editable package in order to avoid reinstallation after every modification.
CONVERSION
Similar to compiling onboard, when running command line vela tool you can simply run this command:
vela pose_detection_full_quant.tflite
After conversion, network details, including total MACs (instead of the estimated inferene time), are printed to terminal:
The optimized version of the Tensorflow Lite model will be output to ./output/my_network_vela.tflite
along with the CSV file.
GOING FURTHER
There are many different options that can be selected when compiling with Vela, such as various memory configurations, as well as trade-offs of performance vs peak SRAM usage. The ethos-u-vela repo contains instruction for selecting additional options when compiling via the command line.
Let me know in the comments if you would like me to cover these in more detail in a future project.
CONVERTED MODELSAfter conversion, if you open the model in eIQ Model Tool (or Netron) you'll notice that most of the layers have been condensed into a single layer named "ethos-u." These layers are denoted in dark gray, while the CPU operators are shown in black.
In the pose detection model, the DepthToSpace operators are not converted because they aren't supported yet by the vela compiler (DepthToSpace was first supported in Tensorflow 2.16.1 so it's likely it will be supported in the next version of the compiler).
If you converted both the full integer quantized and integer only quantized models, you'll see a difference in how many operators are placed on the CPU.
For the Full Integer Quantized model, the inputs and must first run through a quantize operator, and the outputs must be dequantized. Quantize and Dequantize aren't supported by the Ethos-U NPU (remember, the NPU only supports INT8, UINT8, and UINT16 operations).
The Integer Only Quantized model is able to put more operations on the NPU, and will likely run faster.
So what are the final performance results after converting the model to Vela? We already got a sneak peak at what the performance could be after conversion.
First, you may notice a difference in the model's size between the quantized model we started with and the output vela model. Here's the difference between the integer only quantized pose detection model:
- Original quantized pose_detection model size: 3.6MB
- Vela converted pose_detection model size: 1.8MB
Let's see if Vela's performance estimate was close to what the benchmark tool actually gets by using the benchmark tool in "/usr/bin/tensorflow-lite-[VERSION]/examples." Don't forget to include the external delegate path in the command:
/usr/bin/tensorflow-lite-2.10.0/examples/benchmark_model --graph=pose_detection_int_only_quant_vela.tflite --external_delegate_path=/usr/lib/libethosu_delegate.so
For the pose detection model, here are the stats the I benchmarked:
- Quantized model performance: 6.86ms
- Estimated Vela performance: 8.58ms
- Benchmarked Vela performance: 6.78ms
The Vela estimated performance is worse than the actual benchmarked performance.
It's also interesting to note that the Vela performance isn't that much higher than the quantized model performance. This is likely due to the DepthToSpace operators falling back to CPU.
The Landmark model had more significant performance gains when converted to Vela. This is likely because all operations on this model are able to run on the NPU.
I'd love to hear about your Vela conversion results. Thanks for reading!
Comments
Please log in or sign up to comment.