This blog extends my TensorFlow Lite for Microcontrollers tutorial. I read through the TinyML book. The 17th chapter of the book details how and why to reduce the model size in a TinyML project. I found it very useful and wanted to make a practical guide using it so that other people working on it can get started with model size optimization faster.
This blog in a nutshell1. Why optimize model and binary size?
2. Understanding your system's limits
3. Estimating memory usage
4. Ballpark Figures for Model Accuracy and Size on different Problems
5. Conclusion
1. Why optimize model and binary size?Whatever platform you decide on, flash storage and RAM are probably going to be quite constrained. The majority of embedded systems have flash read-only storage capacities of less than 1 MB, and many merely have tens of kilobytes.
The same is true for memory: on low-end systems, the amount of static RAM may only be a few hundred bytes, seldom exceeding 512 KB. The good news is that TensorFlow Lite for Microcontrollers is made to function with as little as 20 KB of flash and 4 KB of SRAM; nevertheless, you will need to carefully plan your application and make technical compromises to keep the footprint small.
2. Understanding your system’s limitsThe majority of embedded systems use flash memory, which is only written to when new executables are uploaded, to store programs and other read-only data. Modifiable memory is typically also available, frequently utilizing SRAM technology.
The same technology, which provides quick access while consuming little power, is utilized for caches on larger CPUs, although it has a size restriction. The second tier of expandable memory can be provided by more sophisticated microcontrollers, who use a more power-hungry but scalable technology like dynamic RAM (DRAM). You should be aware of the benefits and drawbacks of potential platforms.
3. Estimating memory usageWhen you have an idea of what your hardware options are, you need to develop an understanding of what resources your software will need and what trade-offs you can make to control those requirements.
3.a Flash usage
You can usually determine exactly how much room you’ll need in flash by compiling a complete executable and then looking at the size of the resulting image. This can be confusing, because the first artifact that the linker produces is often an annotated version of the executable with debug symbols and section information, in a format like ELF. The file you want to look at is the actual one that’s flashed to the device, often produced by a tool like objcopy. The simplest equation for gauging the amount of flash memory you need is the sum of the following factors:
Operating system size
If you're utilising an RTOS of any type, you'll need room in your executable to store the RTOS's code. Building a sample "hello world" application with the features you require enabled is the easiest approach to calculate the footprint because this is typically customizable based on whatever features you're utilising. The size of the image file will serve as a benchmark for the size of the OS software code.
TensorFlow Lite for Microcontrollers code size
The ML framework needs space for the program logic to load and execute a neural network model, including the operator implementations that run the core arithmetic. To get started just compile one of the standard unit tests (like the micro_speech test) that includes the framework and look at the resulting image size for an estimate.
Model data size
If you don’t yet have a model trained, you can get a good estimate of the amount of flash storage space it will need by counting its weights.
For example, a fully connected layer will have a number of weights equal to the size of its input vector multiplied by the size of its output vector. For convolutional layers, it’s a bit more complex; you’ll need to multiply the width and height of the filter box by the number of input channels and multiply this by the number of filters.
Storage space for any bias vectors connected to each layer must also be included. Calculating this can easily become difficult, therefore it may be simpler to actually build a candidate model in TensorFlow and export it as a TensorFlow Lite file. The size of this file, which will be immediately mapped into flash, will provide a precise estimate of how much space it will use. You can also look at the number of weights listed by Keras’s model.summary() method.
Application code size
You’ll need code to access sensor data, preprocess it to prepare it for the neural network, and respond to the results. You might also need some other kinds of user interfaces outside of the machine learning module. This can be difficult to estimate, but you should at least try to understand whether you’ll need any external libraries and calculate what their code space requirements might be.
3.b RAM usage
Because the amount of RAM consumed varies during the course of your program, figuring out how much changeable memory you'll need can be trickier than figuring out how much storage you'll need. You must consider the many layers of your software to determine the total usage requirements, which is comparable to the method of evaluating flash requirements:
Operating system size
Most RTOSs document how much RAM their different configuration options need, and you should be able to use this information to plan the required size.
TensorFlow Lite for Microcontrollers RAM size
The ML framework shouldn't need more than a few kilobytes of SRAM space for its data structures or much memory for its main runtime. Depending on whether your application code constructs these as global or local objects will influence whether they are on the stack or in general memory. These are allocated as a component of the classes utilized by the interpreter.
Model memory size
When a neural network is executed, the results of one layer are fed into subsequent operations and so must be kept around for some time. The lifetimes of these activation layers vary depending on their position in the graph, and the memory size needed for each is controlled by the shape of the array that a layer writes out. These variations mean that it’s necessary to calculate a plan over time to fit all these temporary buffers into as small an area of memory as possible. This is done when the model is first loaded by the interpreter.
Application memory size
Like the program size, memory usage for your application logic can be difficult to calculate before it’s written. You can make some guesses about larger users of memory, though, such It might be challenging to estimate the memory requirements for your application logic before it is created, similar to the programme size. However, you can make some educated guesses about the major memory consumers, such as the buffers you'll need to store incoming sample data or the memory regions libraries will require for preprocessing.
4. Reducing the size of your executableIn a microcontroller application, your model is probably one of the major consumers of read-only memory, but you also need to consider the size of your produced code. We can't simply utilize an unaltered version of TensorFlow Lite when aiming for embedded processors due to this code size restriction because it would use hundreds of kilobytes of flash memory. TensorFlow Lite for Microcontrollers can be compressed to as little as 20 KB, but you may need to make some adjustments to leave out the code that won't be used in your application.
4.a TFLite Micro Size
When you know your entire application’s code footprint size, you might want to investigate how much space is being taken up by TensorFlow Lite Micro. The simplest way to test this is by commenting out all your calls to the framework and seeing how much smaller the binary becomes.
If you don't notice anything close to that, you should double-check that you've captured all the references. You should anticipate at least a 20 to 30 KB decrease. This should work because the linker will remove any code from the footprint that is never called. To improve awareness of where the space is going, you can expand this to other modules in your code as well, provided there are no references.
4.b OpResolver
TensorFlow Lite supports over a hundred operations, but it’s unlikely that you’ll need all of them within a single model. The individual implementations of each operation might take up only a few kilobytes, but the total quickly adds up with so many available. Luckily, there is a built-in mechanism to remove the code footprint of operations you don’t need.
TensorFlow Lite uses the OpResolver interface to look for implementations of each included operation when loading a model. The mechanism to locate the function pointers to an op's implementation given the op definition is contained in the class that you feed into the interpreter when loading a model. This allows you to control which implementations are actually linked in, which is why it exists.
You'll see that an instance of the AllOpsResolver class is used for the majority of the sample code. The OpResolver interface is implemented here, and as the name suggests, it contains entries for each operation supported by TensorFlow Lite for Microcontrollers. This makes it simple to get started because you can load any model that is supported without worrying about what operations it contains.
If you use only a few ops, you can create an instance of the MicroMutableOpResolver class and directly add the op registrations you need. MicroMutableOpResolver implements the OpResolver interface but has additional methods that let you add ops to the list.
static tflite::MicroMutableOpResolver micro_mutable_op_resolver;
micro_mutable_op_resolver.AddBuiltin( tflite::BuiltinOperator_DEPTHWISE_CONV_2D,
tflite::ops::micro::Register_DEPTHWISE_CONV_2D());
micro_mutable_op_resolver.AddBuiltin( tflite::BuiltinOperator_FULLY_CONNECTED,
tflite::ops::micro::Register_FULLY_CONNECTED());
micro_mutable_op_resolver.AddBuiltin(tflite::BuiltinOperator_SOFTMAX,
tflite::ops::micro::Register_SOFTMAX());
The resolver object is declared as static, as you may have noticed. This is due to the fact that the interpreter may call into it at any time, necessitating a lifespan that is at least equal to that of the object created for the interpreter.
4.c Framework constants
In a few places in the library code, array sizes are hard coded in order to prevent dynamic memory allocation. It's worth investigating to see if you can minimize them for your program if RAM space gets very limited.
TFLITE_REGISTRATIONS_MAX is one of these arrays, and it regulates the maximum number of distinct operations that may be registered. Given that it generates an array of 128 TfLiteRegistration structs, each at least 32 bytes in size and requires 4 KB of RAM, the default value of 128 is likely far too many for most applications.
You can also look at other things like kStackDataAllocatorSize in MicroInterpreter, or try shrinking the size of the arena you pass into the constructor of your interpreter.
5. ConclusionI thank my GSoC mentor, Paul Ruiz, for guiding me throughout the project!
Comments
Please log in or sign up to comment.