# MCU Basic & Advanced Study Note > 爭什麼,把AI和MCU摻在一起做tinyML就對了! > https://ithelp.ithome.com.tw/users/20141396/ironman/4855 > AI Embedded System Algorithm optimization and Implementation > https://www.tenlong.com.tw/products/9787111693253 :::info :information_source: **Create Project Support CMSIS NN with STM32CubeIDE** stm32cubeIDE移植神經網路庫CMSIS-NN https://www.bilibili.com/video/BV16J411w731/ I have create a project that already include DSP NN and CMSIS NN. Please refer to the Github repository. *TODO : Upload to github* ::: ## Platform Used ***Used in CMSIS related projects*** * Seeed Studio XIAO nRF52840 https://www.seeedstudio.com/Seeed-XIAO-BLE-nRF52840-p-5201.html * STM32F767ZI https://www.st.com/en/evaluation-tools/nucleo-f767zi.html :::info :bulb: CMSIS Reference *Common Microcontroller Software Interface Standard* https://www.keil.com/pack/doc/CMSIS/General/html/index.html *\[Day 14\] tinyML開發框架\(二\):Arm CMSIS 簡介* https://ithelp.ithome.com.tw/articles/10273236 ::: ***Used in TinyML related projects*** * Arduino Nano 33 BLE Sense https://docs.arduino.cc/hardware/nano-33-ble-sense/ ## MCU Communication Interfaces Reference \: > * 【Maker進階】認識UART、I2C、SPI三介面特性 > https://makerpro.cc/2016/07/learning-interfaces-about-uart-i2c-spi/ ### SPI \(Serial Peripheral Interface\) > Reference \: > * \[STM32\] 18-SPI > https://medium.com/%E9%96%B1%E7%9B%8A%E5%A6%82%E7%BE%8E/stm32-18-spi-679573f11c31 ### UART > Reference \: > * \[STM32\] 19-UART > https://medium.com/%E9%96%B1%E7%9B%8A%E5%A6%82%E7%BE%8E/stm32-19-uart-8f104abc0798 Good Tutorial about How to Debug and What can Go Wrong for UART **\[淺談UART協定\] 測試及解析幾個常見造成UART錯誤的原因。 \[UART序列埠\]\[CH340\]\[MaxBuadRate\]\[UAR_ErrorRate\]\[UART_Recovery\] by [阿吉米德](https://www.youtube.com/@modernagmid)** https://www.youtube.com/watch?v=ZJaQryS0lHk ### I2C > Reference \: > * \[STM32\] 20-I2C > https://medium.com/%E9%96%B1%E7%9B%8A%E5%A6%82%E7%BE%8E/stm32-20-i2c-c4c9ce6d1c3a > * I2C For Hackers: The Basics > https://hackaday.com/2024/08/07/i2c-for-hackers-the-basics/ > * I2C For Hackers: Digging Deeper > https://hackaday.com/2024/09/05/i2c-for-hackers-digging-deeper/ ## Cortex Microcontroller Software Interface Standard \(CMSIS\) Framework > Reference \: > https://www.keil.com/pack/doc/CMSIS/General/html/index.html ## CMSIS-DSP > Reference \: > https://www.keil.com/pack/doc/CMSIS/DSP/html/index.html * Basic math functions * Fast math functions * Complex math functions * Filtering functions * Matrix functions * Transform functions * Motor control functions * Statistical functions * Support functions * Interpolation functions * Support Vector Machine functions (SVM) * Bayes classifier functions * Distance functions * Quaternion functions ### Matrix Calculation In the following code, a matrix computation on a float32 matrix and a q31 one is demostrated. Notice that there are no differece in there output and extra conversion is needed for quantized version. :::info :bulb: ***Quantized Types*** Refer to other post. > AI Embedded Systems Algorithm Optimization and Implementation > https://hackmd.io/@Erebustsai/Sk39malXa ::: ***Example Full Code*** ```cpp /* * @author ErebusTsai * @note In most of my work, I use colSize, rowSize instead of nrow, ncol respectively. * */ #include "arm_math.h" #include "Adafruit_TinyUSB.h" const float32_t A_f32[4]{ 0.1, 0.2, 0.3, 0.4 }; q31_t A_q31[4]; const float32_t B_f32[16]{ 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.1, 0.11, 0.12, 0.13, 0.14, 0.15, 0.16 }; q31_t B_q31[16]; float32_t Y_f32[4]; q31_t Y_q31[4]; float32_t Y_f32q[4]; arm_status status; // CMSIS-DSP calculation result status arm_matrix_instance_f32 A; arm_matrix_instance_f32 B; arm_matrix_instance_f32 Y; arm_matrix_instance_q31 Aq; arm_matrix_instance_q31 Bq; arm_matrix_instance_q31 Yq; uint32_t srcRows, srcColumns; void setup() { Serial.begin(9600); while (!Serial) ; Serial.print("XIAO-nRF52840 Start\n"); // Initial f32 Matrix Serial.print("Start Initial f32 Matrix\n"); srcRows = 1; srcColumns = 4; arm_mat_init_f32(&A, srcRows, srcColumns, (float32_t*)A_f32); srcRows = 4; srcColumns = 4; arm_mat_init_f32(&B, srcRows, srcColumns, (float32_t*)B_f32); srcRows = 1; srcColumns = 4; arm_mat_init_f32(&Y, srcRows, srcColumns, (float32_t*)Y_f32); // Initial q15 Matrix Serial.print("Start Initial q31 Matrix\n"); arm_float_to_q31(A_f32, A_q31, 4); arm_float_to_q31(B_f32, B_q31, 16); srcRows = 1; srcColumns = 4; for (int i = 0; i < srcRows; ++i) { for (int j = 0; j < srcColumns; ++j) { Serial.print(A_q31[i * srcRows + j]); Serial.print(" "); } Serial.print("\n"); } srcRows = 4; srcColumns = 4; for (int i = 0; i < srcRows; ++i) { for (int j = 0; j < srcColumns; ++j) { Serial.print(B_q31[i * srcRows + j]); Serial.print(" "); } Serial.print("\n"); } srcRows = 1; srcColumns = 4; arm_mat_init_q31(&Aq, srcRows, srcColumns, (q31_t*)A_q31); srcRows = 4; srcColumns = 4; arm_mat_init_q31(&Bq, srcRows, srcColumns, (q31_t*)B_q31); srcRows = 1; srcColumns = 4; arm_mat_init_q31(&Yq, srcRows, srcColumns, (q31_t*)Y_q31); } void loop() { Serial.print("Start f32 matrix multiplication\n"); status = arm_mat_mult_f32(&A, &B, &Y); if (status != ARM_MATH_SUCCESS) { printf("FAILURE\n"); while (true) ; } Serial.print("Start q15 matrix multiplication\n"); status = arm_mat_mult_q31(&Aq, &Bq, &Yq); if (status != ARM_MATH_SUCCESS) { printf("FAILURE\n"); while (true) ; } Serial.print("Convert q15 output to f32\n"); arm_q31_to_float(Yq.pData, Y_f32q, 4); for (int i = 0; i < srcRows; ++i) { for (int j = 0; j < srcColumns; ++j) { Serial.print(Y.pData[i * srcRows + j]); Serial.print(" "); } Serial.print("\n"); } for (int i = 0; i < srcRows; ++i) { for (int j = 0; j < srcColumns; ++j) { Serial.print(Y_f32q[i * srcRows + j]); Serial.print(" "); } Serial.print("\n"); } } ``` ***Output*** ```bash 04:48:30.686 -> XIAO-nRF52840 Start 04:48:30.686 -> Start Initial f32 Matrix 04:48:30.686 -> Start Initial q31 Matrix 04:48:30.686 -> 214748368 429496736 644245120 858993472 04:48:30.686 -> 214748368 429496736 644245120 858993472 04:48:30.686 -> 1073741824 1288490240 1503238528 1717986944 04:48:30.686 -> 1932735232 214748368 236223200 257698032 04:48:30.686 -> 279172864 300647712 322122560 343597376 04:48:30.686 -> Start f32 matrix multiplication 04:48:30.686 -> Start q15 matrix multiplication 04:48:30.686 -> Convert q15 output to f32 04:48:30.686 -> 0.43 0.23 0.26 0.30 04:48:30.686 -> 0.43 0.23 0.26 0.30 ``` ## Intro to TinyML with TFLM \(TensorFlow Lite for Microcontrollers\) :::info :bulb: **Hardware Used** * *GPU info from Tensorflow* ``` 2024-04-09 12:14:13.664609: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1720] Found device 0 with properties: pciBusID: 0000:81:00.0 name: Tesla P40 computeCapability: 6.1 coreClock: 1.531GHz coreCount: 30 deviceMemorySize: 24.00GiB deviceMemoryBandwidth: 323.21GiB/s ``` * *Tensorflow Version* ``` >>> tf.version.VERSION '2.4.0' ``` ::: ### Application of TinyML * Predictive Maintenance \: By monitoring the vibration, torque and other things to check if a machine is function correctly. * Healthcare * Agriculture * Voice-assisted devices * Ocean-life conservation :::info :bulb: **FlatBuffer** https://zhuanlan.zhihu.com/p/391109273 :::: ## Hello World Project Tensorflow Lite ![image](https://hackmd.io/_uploads/S1I-hjjGA.png) Reference \: https://www.researchgate.net/figure/Tensorflow-Lite-Workflow-adapted-from-Cavagnis-2023_fig1_376715530 ### Convert Tensorflow Model to TFLite Model In the following code snippet, the tensorflow model in `h5` format is converted and saved as a TFLite model. ```pyhton baseline_model = load_model('baseline_model.h5') converter = tf.lite.TFLiteConverter.from_keras_model(baseline_model) tflite_model = converter.convert() tflite_models_dir = pathlib.Path("./") tflite_models_file = tflite_models_dir/'model.tflite' tflite_models_file.write_bytes(tflite_model) ``` Another way to convert model into TFLite model is create a graph representation of the model also known as *concrete function*. ```python # export model as a concrete function func = tf.function(baseline_model).get_concrete_function( tf.TensorSpec(baseline_model.inputs[0].shape, baseline_model.inputs[0].dtype) ) # serialized graph representation of the concrete function func.graph.as_graph_def() # converting the concrete function to TfLite converter = tf.lite.TFLiteConverter.from_concrete_functions([func]) tflite_model = converter.convert() tflite_models_dir = pathlib.Path("./") tflite_models_file = tflite_models_dir/'model_concrete.tflite' tflite_models_file.write_bytes(tflite_model) ``` :::info :bulb: **TensorFlow2.X——tf.function && Autograph介紹以及使用** https://blog.csdn.net/qq_40913465/article/details/104604979 > Reference \: Book P117 > We can further convert the baseline model to a TensorFlow graph using tf.function,which contains all the computational operations, variables, and weights. ::: ### Use TFLite Interpreter :::info :bulb: A Good Article Series about TFLite https://hackmd.io/@yillkid/ByQ7ySDT8/https%3A%2F%2Fhackmd.io%2F%40yillkid%2FrkUlAjkGF ::: ```python tflite_model_file = 'model_concrete.tflite' interpreter = tf.lite.Interpreter(model_path=tflite_model_file) interpreter.allocate_tensors() input_index = interpreter.get_input_details()[0]['index'] output_index = interpreter.get_output_details()[0]['index'] pred_list = [] for images in X_test: input_data = np.array(images, dtype=np.float32) input_data = input_data.reshape(1, input_data.shape[0], input_data.shape[1], 1) interpreter.set_tensor(input_index, input_data) interpreter.invoke() prediction = interpreter.get_tensor(output_index) prediction = np.argmax(prediction) pred_list.append(prediction) ``` ### Quantization ## ESP-IDF vs Arduino Framework Performance The following is the performance comparison between the output binaries of ESP-IDF and Arduino Framework. https://www.youtube.com/watch?v=O-7rPkya4Yw