TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems

# TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems ###### tags:`TinyML` ## source: ###### papers: https://proceedings.mlsys.org/paper/2021/file/d2ddea18f00665ce8623e36bd4e3c7c5-Paper.pdf?fbclid=IwAR34noTjYaETlHmuZaQVZj_KycyYK9EQJgoYyVXT14T_4zfYAjjlyEXC-gk ###### github: https://github.com/tensorflow/tensorflow/tree/master/tensorflow/lite/micro ## Overview: * TensorFlow Lite for Microcontrollers, which allows us to run inference on microcontrollers. * TensorFlow Lite for Microcontrollers requires you to use a 32-bit processor, such as an ARM Cortex-M or ESP32. Also note that the library is (mostly) written in C , so you will need to use a C compiler. * As microcontrollers are limited in resources, we generally perform model training on our computer and run the inference on the microcontrollers. ## Implementation: ![](https://i.imgur.com/HiJCMAB.png) ### Step: #### 1. The first step in developing a TFLM application is to create a live neural-network-model object in memory. It produces an “operator resolver” object through the client API. This API controls which operators link to the final binary, minimizing file size. #### 2. The second step is to supply a contiguous memory “arena”(The size required will depend on the model you are using, and may need to be determined by experimentation) that holds intermediate results and other variables the interpreter needs, since we assume dynamic memory allocation is unavailable. #### 3. The third step is to create an interpreter instance, supplying it with the model, operator resolver, and arena as arguments.(the memory allocates by interpreter) #### 4. Then, execution. The application retrieves pointers to the memory regions that represent the model inputs and populates them with values. Once the inputs are available, the application invokes the interpreter to perform the model calculations. #### 5. Finally, once invocation finishes, the application can query the interpreter to determine the location of the arrays containing the model-calculation outputs and then use those outputs. ### TFLM Interpreter: * The basis of deploying production models on embedded hardware. * Make share the code across multiple models and applications become easier. * Allows updates without re-exporting the model. ### Model loading: #### Model representation: * advantage: Easy to develop for embedded platforms.(Since the storage schema is copied from the the TensorFlow Lite representation) * disadvantage: It was designedto be portable from system to system, so it requires run-time processing to yield the information that inferencing requires. ### Memory management: * To prevent memory errors from interrupting a long-running program, we ensure that allocations only occur during the interpreter’s initialization phase. No allocation is possible during model invocation. #### The two-stack allocation: ![](https://i.imgur.com/IRnXJDw.png) * initialization- and evaluation-lifetime allocations reside in a separate stack relative to interpreter-lifetime objects. * The figure above shows it uses a stack that increments from the lowest address for the function-lifetime objects (“Head” in the image) and a stack that decrements from the arena’s highest address for interpreter-lifetime allocations (“Tail” in the image). * When the two stack pointers cross, they indicate a lack of capacity #### Optimation --> memory planner: ![](https://i.imgur.com/UyQ3wgk.png) * The “Memory Planner” encapsulates this process(Naive to Bin packing). * However, Memory planning at run time incurs more overhead during model preparation than a preplanned memory-allocation strategy. * The benefit of model generality. * TFLM models simply list the operator and tensor requirements. So at run time, we allocate and enable this capability for many model types. #### Multitenancy * TFLM supports multitenancy with some memoryplanner changes that are transparent to the developer. * The memory-arena reuse by enabling the multiple model interpreters to allocate memory from a single arena. * TFLM supports multitenancy with some memoryplanner changes that are transparent to the developer. * It allow interpreter-lifetime areas to stack on each other in the arena and reuse the function-lifetime section for model evaluation. ##### resuable vs nonresuable part: * The reusable (nonpersistent) part is set to the largest requirement, based on all models allocating in the arena. * The nonreusable (persistent) allocations grow for each model—allocations are model specific . #### Multithreading: * TFLM is thread-safe as long as there is no state corresponding to the model that is kept outside the interpreter and the model’s memory allocation within the arena. ![](https://i.imgur.com/t9AbtSs.png) ### System evaluation: * We will run on two models in this experiment.(Visual Wake Words and Google Hotword) #### The microprocessor: ![](https://i.imgur.com/xPGu296.png) #### Benchmark Performance: * In the image below, we can see that the interpreter overhead of Google Hotword is higher. That's because less time goes to the kernel calculations. ![](https://i.imgur.com/bVQrh3S.png) * To further analyze memory usage, recall that TFLM allocates program memory into two main sections: persistent and nonpersistent ![](https://i.imgur.com/oEvSA4e.png) #### Benchmarking and Profiling: * TFLM provides a set of benchmarks and profiling APIs (TensorFlow, 2020c) to compare hardware platforms and to let developers measure performance as well as identify opportunities for optimization. * Benchmarks provide a consistent and fair way to measure hardware performance. ### Conclusion: * TFLM enables the transfer of deep learning onto embedded systems * It is a framework that has been specifically engineered to run machine learning effectively and efficiently on embedded devices with only a few kilobytes of memory.