CMSIS-NN & Optimizations for Edge AI

# CMSIS-NN & Optimizations for Edge AI ###### tags: `TinyML` ###### slides: [link](https://cms.tinyml.org/wp-content/uploads/talks2021/tinyML_Talks_Felix_Johnny_Thomasmathibalan_and_Fredrik_Knutsson_210208.pdf) ## Introduction - What is CMSIS-NN ![](https://i.imgur.com/o9oyIZB.png) - Input TFL for Microcontrollers - Destination ARM(R) Cortex(R) Processor (System-on-Chip) - Optmization 1. Model Optimizations 2. SW Environment Settings 3. <mark>Library Optimizations</mark> 4. HW Design ## CMSISNN: Library Optimizations ### Preparation - Cycle Measurements ###### Non-optimized kernels ![](https://i.imgur.com/ty7sqC2.png) - Measure the cycles layer by layer - Setting a Target for Optimization Tells you how realistic your expectations are. ###### The library optimizer ![](https://i.imgur.com/7qrNzss.png) - Cycle bound analysis - Processor capability & utilization ![](https://i.imgur.com/fE0emIX.png) C~t~ = Speed of Light (SoL) or capability. Best case cycles for execution of a given arithmetic operator. C~a~ = Actual cycles for execution. ![](https://i.imgur.com/gfkGIxf.png) - The value in the table means # of MACs per cycle. - e.g. The M4 can perform 1 MAC at a time in one cycle. - Example (TFLM reference kernels) ![](https://i.imgur.com/nee1z3V.png =200x) ![](https://i.imgur.com/TqCqG1T.png =300x) - In the left figure, the DW convolutions do 13.5% MACs but it consumes near 40% cycles. -> Unfair representation as conv and DW conv do more than just MACs. - In the right figure, convolution is 4 times faster than DW convolution. - Memory bound analysis - Theoretical bandwith - Maximum data transfer rate for a given hardware specification. - Effective bandwidth - Actual data transfer rate for an operation. For e.g. read/write of weights, inputs and outputs in a convolution operation. - For neural network applications, we usually have different memory blocks (on and off chip) are involved, and usually it is about reducing the effective bandwidth. Optimization is about striking a balance between the two. You want to fectch the enough data so that your processor is not stalled for data at the same time you don't want your process to be waiting on data ### Model Optimizations - Use case: Fully Connected - im2col - Inputs and weights are contiguous in memory. The target is convolution or DW convolution. 1. Loop unrolling can reduce input reads -> Depends on the number of general purpose (GP) & vector registers. We also need to check that if the compiler output match the expectation because sometimes it might not be the case. 2. SIMD ![](https://i.imgur.com/MQfoLa8.png) Load 16 byte data in a instruction, and it can perform 8 instructions and 8 MACs. 3. Simplification. e.g. Move some instructions outside the loop, and free up registers that can be used for deeper unrolling instead. ### Essentials of Optimization - Memory access optimization - Reducing relevant memory accesses while staying within the available GP/vector register constraint. - Keep it simple - Constant pointer increment/decrement in core loops - Simplify the core loop further by moving out input/filter offset adjustments - Processor capability - Single Instruction Multiple Data Optimizations ### Aligned access ![](https://i.imgur.com/w2hD3q8.png =200x)![](https://i.imgur.com/3lJcKxe.png =200x) ![](https://i.imgur.com/KmH7cCg.png) - Unaligned access is detrimental - Applicable for other operators like max pool, average pool and FC layer as well - They changed the input size from 3x3x3 to 3x3x4, and then there are 33.3% increase in MACs and 21.1% reduction in cycles. - Memory alignment aware shapes get the best out of CMSIS NN ## Deploy CMSIS-NN using Tensorflow Lite for Microcontrollers ![](https://i.imgur.com/62h66t4.png =200x) - Optimized kernels are enabled using OPTIMIZED_KERNEL_DIR= cmsis_nn in the TFLM build system - Use optimized kernels when available otherwise fallback to TFLM reference kernels - Ref kernels are written in pure C ### What kernels are optimized? ![](https://i.imgur.com/sXPL0AD.png =300x) percent the number of cycles per operator - Quantization - Operators