This blog extends my TensorFlow Lite for Microcontrollers tutorial. I read through the TinyML book. The 15th chapter of the book details how and why to reduce latency in a TinyML project. I found it very useful and wanted to make a practical guide using it so that other people working on it can get started with latency optimization faster.
This blog in a nutshell1. Why optimize latency?
2. Is optimization really needed here?
3. Hardware changes to reduce latency
4. Model Architecture
5. Quantization
6. Product design
7. Performance profiling
8. Taking advantage of hardware features
9. Conclusion
Embedded systems don’t have much computing power, which means that the intensive calculations needed for neural networks can take longer than on most other platforms, and running slowly can cause a lot of problems.
Two examples to reinforce the point:
- Consider trying to capture a moment in time that might only happen once, such as a bird flying into a camera's field of vision. You might sample the sensor too slowly and miss one of these instances if your processing time is too long.
- The quality of a prediction is sometimes improved by repeated observations of overlapping windows of sensor data. The wake-word detection example runs a one-second window on audio data for wake-word spotting, but moves the window forward only a hundred milliseconds or less each time, averaging the results. Hence, it is important to process the data quickly to run multiple inferences.
In these situations, reducing latency enables us to improve overall accuracy. A device may be able to consume less energy overall by functioning at a lower CPU frequency or sleeping between inferences if the model execution is sped up.
2. Is optimization really needed here?It's likely that the neural network code makes up such a little portion of your system latency that improving its speed wouldn't significantly affect the performance of the overall application.
The simplest way to determine whether this is the case is by commenting out the call to invoke the interpreter in your application code.
In practice, comment out the below line from your code.
// tflite::MicroInterpreter::Invoke()
This is the function that contains all of the inference calculations, and it will block other operations until the network has finished inference, so by removing it you can observe what difference it makes to the overall latency.
There isn't much to gain from optimising the deep learning portion of the code if the difference between executing the network inference and not is minimal; instead, you should pay attention to other aspects of your application first.
3. Hardware changes to reduce latencyThe first thing to consider if you need to speed up your neural network code is whether you can employ a more powerful hardware device. It’s the easiest factor to change from a software perspective, it’s worth explicitly considering. When you do have a choice, energy, speed, and cost are typically your biggest limitations. If possible, alter the chip you're using to reduce energy consumption or cost while increasing speed.
4. Model architectureThe next easiest place to have a big impact on neural network latency is at the architecture level. If you can create a new model that is accurate enough but involves fewer calculations, you can speed up inference without making any code changes at all.
If you can start with a model that is as accurate as it can be at the start, you'll have a lot more room for these trade-offs since it is typically possible to trade off improved speed for decreased accuracy.
- This means that even with seemingly unrelated activities like latency optimization, investing time in refining and expanding your training data may be quite beneficial throughout the development process.
- You can also try experimenting with removing specific layers from the model you're using to see what happens. Feel free to experiment with many different destructive adjustments and observe their effect on accuracy and latency because neural networks typically deteriorate extremely gracefully.
The majority of neural network models' computation time is devoted to performing huge matrix multiplications. The work required is roughly equal to the number of input values times the number of output values for each layer in the network since each input value must be scaled by a separate weight for each output value. This is approximated by talking about the number of floating point operations (FLOPs) that a network requires for a single inference run.
A model with fewer calculations will run faster and in proportion to the difference in FLOPs, hence FLOPs are useful as a rough indicator of how long a network will take to execute.
For example, you could reasonably expect a model that requires 100 million FLOPs to run twice as fast as a 200-million FLOP version. This isn’t entirely true in practice, but it’s a good starting point for evaluating different network architectures.
Deep learning models are able to cope with large losses in numerical precision during intermediate calculations and still produce end results that are accurate overall. This property seems to be a by-product of their training process, in which the inputs are large and full of noise, so the models learn to be robust to insignificant variations and focus on the important patterns that matter
What this means in practice is that operating with 32-bit floating-point representations is almost always more precise than is required for inference. Training is a bit more demanding because it requires many small changes to the weights to learn.
Most inference applications can produce results that are indistinguishable from the floating-point equivalent, using just 8 bits to store weights and activation values. Running fully quantized models also has big latency benefits on almost all platforms.
6. Product designWhile you might not see your product design as a technique to reduce latency, it is one of the best places to focus your attention. The trick is to determine whether you can relax the demands on your network, either for accuracy or speed. Two examples to reinforce the same is given below:
Example 1: If your body pose detection model runs in less than a second, you might be able to use a much faster optical tracking algorithm to follow the identified points at a higher rate, updating it with the more accurate but less frequently occurring neural network results when they're available. This would allow you to track hand gestures with a camera at several frames per second.
Example 2: while wake-word detection remains active on the local device, enhanced speech recognition could be delegated by a microcontroller to a cloud API accessed over a network.
7. Performance ProfilingThe foundation of any code optimization effort is knowing how long different parts of your program take to run. This can be surprisingly difficult to figure out in the embedded world because you might not even have a simple timer available by default, and even if you do, recording and returning the information you need can be demanding. A variety of approaches is given below.
7.a Blinky
At least one LED is present on almost all embedded development boards, and your software might be able to control it. Try turning on the LED at the beginning of the code section you want to measure and then turning it off at the end if you're measuring periods that are more than half a second. Using an external stopwatch and manually counting how many blinks you observe in 10 seconds, you should be able to roughly estimate the time required. Additionally, you can compare the relative flash frequency of two development boards that use different versions of the code side by side to see which one is quicker.
7.b Shotgun profiling
The simplest way to determine how long a specific piece of code will take is to comment it out and observe how much faster the entire execution is once you have a general notion of how long a typical run of your application will take. By comparison with shotgun debugging, in which you eliminate sizable portions of code to find crashes when little other information is available, this technique has been dubbed shotgun profiling. Because there are often no data-dependent branches in the model execution code, making any one action a no-op by commenting out its internal implementation shouldn't influence the speed of other portions of the model. This can be surprisingly useful for neural network debugging.
7.c Debug Logging
This would appear to be the perfect way to tell whether a piece of code is running because, in most situations, your embedded development board should be able to output a line of text back to a host computer. Unfortunately, talking to the development machine can take a lot of time in and of itself. On an Arm Cortex-M device, serial wire debug output can have a latency of up to 500 ms and is very variable, making it unusable for a straightforward method of log profiling. UART-based debug logging is typically far less expensive, although it's still not ideal.
7.d Logic analyzer
You can have your code turn GPIO pins on and off in a manner similar to toggling LEDs, but with a lot more accuracy, and then use an external logic analyzer to visualize and monitor the time. This gives a fairly flexible way to study the latency of your application without requiring any software assistance beyond the control of one or more GPIO pins, while the equipment itself can be pricey and takes a little bit of wiring.
7.e Timer
You can output the duration to logs after recording the time at the beginning and conclusion of the code section you're interested in, where any communication latency won't affect the outcome, if you have a timer that can provide you with a consistent current time with sufficient accuracy.
7.f Profiler
If you're fortunate, the toolchain and platform you're using will support some sort of external profiling tool. When running your code on a device, these tools will often gather data on execution and compare them to debug information from your code. They will then be able to see which code lines or methods are using the most time. Because you'll be able to quickly explore and zoom into the functions that matter, this is the quickest approach to determine where the speed bottlenecks in your code are.
8. Taking advantage of hardware featuresYou might find yourself on a platform like a Cortex-M device with SIMD instructions, which are often a big help for the kinds of repetitive calculations that take up most of the time for neural network inference. Check the documentation of the vendor-supplied libraries to see whether there’s something suitable already written to implement a larger part of an algorithm because that will hopefully be highly optimized.
9. ConclusionI thank my GSoC mentor, Paul Ruiz, for guiding me throughout the project!
Comments
Please log in or sign up to comment.