The Idea of Work

--- title: The Idea of Work tags: Research --- # The Idea of Work - Make deploy easier - Dynamically operation library - Memory management - The duplication of model file - Swap model - Affection of latency - Swap policy - The copy of interpreter and operation library - More than one inference program on a device - Make adding custom operation easier # My work ## 軟硬體開發環境 - 開發板: [STM32H745I-DISCO](https://www.st.com/en/evaluation-tools/stm32h745i-disco.html) - MCU: [STM32H745XI ](https://www.st.com/en/microcontrollers-microprocessors/stm32h745xi.html) - Flash: 2 Mbytes - RAM: 1 Mbyte - CPU: Cortex M4, Cortex M7 - little endian - 240MHz - 480MHz - [MPU and cache](http://news.eeworld.com.cn/mcu/ic486426.html) - 2 x 512-Mbit Quad-SPI NOR Flash memory - 128-Mbit SDRAM - 4-Gbyte on-board eMMC - 開發環境 - 核心版本: Linux 4.15.0-74-generic - 發行版本: Ubuntu 16.04 - 開發 IDE: STM32CubeIDE - Compiler: gcc, g++ - -std=gnu11 - -std=gnu++14 - Memory map ![](https://i.imgur.com/6PVFVs0.png) ![](https://i.imgur.com/4AgP1YG.png) - System architecture ![](https://i.imgur.com/dxv9S59.png) - Memory usage - DTCM: main memory - AXI SRAM: FatFS - SRAM1: - SRAM2: - SRAM3: - SRAM4: LwIP ## Architecture ![](https://i.imgur.com/VXCs4S5.png) ## Workflow of startup |M4|M7| |--|--| |- Reset_Handler - Set the initial SP - Copy the data segment initializers from flash to SRAM - Zero fill the bss segment|- The same as M4 core 　　　　　　　　　　　　　　　.| |- Call SystemInit() (Initialize the FPU setting and vector table location configuration)|- The same as M4 core (Moreover, reset the RCC clock configuration)| |- Call \_\_libc_init_array() - Branch to main|- The same as M4 core| |Wait a hardware semaphore to synchronize with M7 core|| ||- Configure the system clock - Prepare message buffer - Signal the semaphore| |- Synchronization point|- Synchronization point| |- Call HAL_Init() &emsp;- Configures the SysTick to generate an interrupt each 1 millisecond, which is clocked by the HSI (at this stage, the clock is not yet configured and thus the system is running from the internal HSI at 16 MHz). &emsp;- Set NVIC Group Priority to 4. &emsp;- Calls the HAL_MspInit() that is callback function defined in user file "stm32h7xx_hal_msp.c" to do the global low level hardware initialization|- The same as M4 core| |- Initialize required peripherals - Initialize required interrupt channel|- The same as M4 core| |- Call xTaskCreate() - Call vTaskStartScheduler()|- The same as M4 core| ## Workflow of tflite inference process 1. Informed of a new inference task by M4 core. 1. Triggered by interrupt 2. Receive model file name with message buffer 2. Get model offset in tflite model file 3. Check the schema version 4. Parse the model and prepare `inference control block` - offset of read only tensor - Tensor_arena size for read write tensor - 除了 read only tensor data 之外的 model 資料都應該要放在 inference control block 裡面 ```=c typedef Model Model_t; typedef SubGraph SubGraph_t; typedef flatbuffers::Vector<flatbuffers::Offset<Tensor>> Tensors_t typedef flatbuffers::Vector<flatbuffers::Offset<Operator>> Operators_t typedef void(*(OperationLibrary_t)[])(void); typedef struct xNODE_AND_REGISTRATION { ... } NodeAndRegistration_t; typedef struct xTF_LITE_CONTEXT { ... } TfLiteContext_t; typedef struct xINFERENCE_CONTROL_BLOCK { /* The offset of the root in flatbuffer format. */ const char *pcFileName; const Model_t *pxModel; const SubGraph_t *pxSubGraph; const Tensors_t *pxTensors; const Operators_t *pxOperators; const OperationLibrary_t *pxOperationLibrary; NodeAndRegistration_t *pxNodeAndRegistration TfLiteContext_t xContext; TfLiteStatus_t xStatus; } InferenceControlBlock_t; ``` 5. Send the interpreter handle or fail 1. Triggered by interrupt 2. Send the address of inference control block (interpreter handle) with message buffer 6. Start to inference 1. Triggered by interrupt 2. Received the address of inference control block (interpreter handle) with message buffer 3. Maintain a waiting list 7. Inference detail 1. Check the status of inference control block 2. 將第一個跟第二個 operator 需要用到的 read only tensor data 從 file 搬進 memory 3. for every operator 1. Get node of this operator 2. Get registration of this operator 3. 確保 read only tensor data 已經從 file 搬進 memory 並設定 tensor data pointer 4. Call invoke of this registration ```=c TfLiteStatus_t (*invoke)(TfLiteContext_t *pxContext, TfLiteNode_t *pxNode) ``` 1. Get address of parameter structure 2. Get address of input tensors 3. Get address of output tensors 4. According to the implementation 5. 將下下一個 operator 需要用到的 read only tensor data 從 file 搬進 memory ## Workflow of AI application ```=c static void prvAiTask(void *pvParameters) { InterpreterHandle_t xInterpreter; TfLiteTensor_t *pxInput, *pxOutput; BaseType_t xReturnValue; const char pcModelFileName[] = "person_detect.tflite"; xReturnValue = xPrepareInference(pcModelFileName, &xInterpreter); if(!xReturnValue) { printf("There is something wrong\n"); while(1); } pxInput = xGetModelInput(xInterpreter, 0); if(!pxInput) { printf("There is something wrong\n"); while(1); } pxInput->data.f[0] = 0.0; xReturnValue = xInvoke(xInterpreter); if(!xReturnValue) { printf("There is something wrong\n"); while(1); } pxOutput = xGetModelOutput(xInterpreter, 0); if(!pxOutput) { printf("There is something wrong\n"); while(1); } printf("%f\n"pxOutput->data.f[0]); } ``` ## Workflow of inference API ### xInferenceInterfacePrepareInference 1. Use message buffer to send model file name and AI task number to inference process 2. Use message buffer to receive inference control block handle 3. Return the inference control block handle ## Workload :::info keras [範例程式](https://github.com/super13579/CNN_revolution) ::: 1. Person detection - Model size: 236,072 bytes - Input image size: 96\*96\*1 bytes - TFLite for Microcontrollers benchmark model 2. Image classification - Model size: - Input image size: 32\*32\*3 bytes - MobileNet 3. Image classification - Model size: - Input image size: 32\*32\*3 bytes - ResNet 4. Image classification - Model size: - Input image size: 32\*32\*3 bytes - Evaluation model in CMSIS paper 5. Image classification - Model size: - Input image size: 32\*32\*3 bytes - VGG16 ## 1.Hello_World 這是一個範例來瞭解如何使用 IDE，這個範例將一個 on-board LED 燈亮起來 1. 創建專案，選擇開發板 stm32h745i-disco 2. 設定 [debug configuration](https://github.com/apple11361/Tensorflow_Lite_Inference_on_Edge_Device/blob/master/0.Document/3.Getting%20started%20with%20projects%20based%20on%20dual-core%20STM32H7%20microcontrollers.pdf) 3. 開始 CM7 debug session 4. 開始 CM4 debug session 5. 執行 CM4 程式，Domain 2 會進入 stop mode，CPU 進入 deep-sleep 狀態 6. 執行 CM7，CM7 初始化系統後會喚醒 CM4，並設定 GPIO 讓連接的 LED 燈發光 7. 繼續執行 CM4 ## 2.UART_Example 利用 HAL library 練習使用 UART。Cortex-m4 會在初始化 USART3 後將字串打出來，USART3 是利用 CN14(STLK) 跟電腦連接。 1. 用 `$ dmesg` 看 com port，實驗時是 `/dev/ttyACM0` 2. 用 screen 連接 com port `$ screen /dev/ttyACM0 <speed>`，範例程式碼的 uart 傳輸速度是用 115200 3. Cortex-m4 初始化 USART3 4. Cortex-m4 傳送字串 5. 終端機應該要可以看到打出來的字串 ## 3.FreeRTOS_Example - FreeRTOS version: 10.2.1 - linux com port tool: screen 實際將 FreeRTOS 的範例程式在開發板上執行。這個範例使用 FreeRTOS 的 message buffers 來傳送在不同 CPU 上執行的程式的資料。程式流程如下： 1. 在 Cortex-m7 上執行一個 task 不斷去對兩個 tasks 傳送資料，資料內容是遞增的數字："0", "1", "2", "3"... 2. 在 Cortex-m4 上執行兩個 tasks 不斷接收資料，並透過 uart 打出來 ![](https://i.imgur.com/TdFmqOy.png) 操作流程： 1. 用 `$ dmesg` 看 com port，實驗時是 `/dev/ttyACM0` 2. 用 screen 連接 com port `$ screen /dev/ttyACM0 <speed>`，範例程式碼的 uart 傳輸速度是用 115200 ## 4.TensorflowLite_Example ## Some tips ### Hardware fault 1. 有沒有 stack overflow 2. 有沒有 access NULL pointer 3. 看 SCB(System Control Block) 的 HFSR(Hard Fault Status Register) 暫存器 # FreeRTOS ## Source Organization ``` +-FreeRTOS-Labs Contains the FreeRTOS-Labs | +-FreeRTOS-Plus Contains FreeRTOS+ components and demo projects | +-FreeRTOS Contains the FreeRTOS real time kernel | files and demo projects | +-Demo Contains the demo application projects. | +-Source Contains the real time kernel source code. ``` 1. FreeRTOS kernel 最精簡版包含三個檔案 `tasks.c`, `queue.c` and `list.c` 1. `task.c` 用於管理 freertos 中的 task 2. `queue.c` 用於管理 IPC 3. `list` 用於提供系統與應用實作會用到的 list 資料結構 2. 可選擇加入 `timer.c`, `croutine.c` and `event_groups.c`，這三個檔案分別代表 software timer, co-routine and event_groups ``` +-FreeRTOS Contains the FreeRTOS real time kernel source | files and demo projects | +-Source Contains the real time kernel source code. | +-include The core FreeRTOS kernel header files | +-Portable Processor specific code. | +-Compiler x All the ports supported for compiler x +-Compiler y All the ports supported for compiler y +-MemMang The sample heap implementations ``` 3. 與硬體架構相關的程式碼放在 `FreeRTOS/Source/Portable/[compiler]/[architecture]` 裡面，根據不同的 compiler 和硬體架構會有不同的程式碼 4. 如果要動態配置記憶體的話在 `MemMang/` 裡面可以找到一些 heap 的範例實做 ``` FreeRTOS | +-Demo | +-Common The demo application files that are | used by all the demos +-Dir x The demo application build files for port x +-Dir y The demo application build files for port y ``` 5. 所有不同硬體的 Demo 程式會共用一份應用程式碼，放在 ` FreeRTOS/Demo/Common/Minimal` 裡面 6. `FreeRTOS/Demo/Common/Full` 裡面的內容是舊的，而且不支援所有硬體移植，只支援 PC ## 資料型態及命名規則 - variable - char ：以 c 為字首 - short ：以 s 為字首 - long ：以 l 為字首 - float ：以 f 為字首 - double ：以 d 為字首 - Enum 變數：以 e 為字首 - portBASE_TYPE 或其他（如 struct）：以 x 為字首 - pointer 有一個額外的字首 p , 例如 short 類型的 pointer 字首為 ps - unsigned 類型的變數有一個額外的字首 u , 例如 unsigned short 類型的變數字首為 us - function: 以回傳值型態與所在檔案名稱為開頭(prefix) - vTaskPriority() 是 task.c 中回傳值型態為 void 的函式 - xQueueReceive() 是 queue.c 中回傳值型態為 portBASE_TYPE 的函式 - 只能在該檔案中使用的 (scope is limited in file) 函式，以 prv 為開頭 (private) - 巨集名稱：巨集在FreeRTOS裡皆為大寫字母定義，名稱前小寫字母為巨集定義的地方 - portMAX_DELAY: portable.h - configUSE_PREEMPTION: FreeRTOSConfig.h - 一般巨集回傳值定義 pdTRUE 及 pdPASS 為 1, pdFALSE 及 pdFAIL 為 0。 ## Task state diagram ![](https://i.imgur.com/SfSSmZr.png) ## Reference - [成大資工 wiki-freertos](http://wiki.csie.ncku.edu.tw/embedded/freertos) - [Modifying a FreeRTOS Demo](https://www.freertos.org/porting-a-freertos-demo-to-different-hardware.html) - [Creating a New FreeRTOS Project](https://www.freertos.org/Creating-a-new-FreeRTOS-project.html) - [FreeRTOS coding style & coding standard](https://blog.csdn.net/zhzht19861011/article/details/50057531) # FlatBuffers - [FlatBuffers 快速入門](https://www.itread01.com/p/15985.html) - [FlatBuffers 的檔案結構](https://blog.csdn.net/weixin_42869573/article/details/83820166) - [FlatBuffers 詳解](https://blog.csdn.net/yxz329130952/article/details/50880191) - [FlatBuffers: Use in C](https://google.github.io/flatbuffers/flatbuffers_guide_use_c.html) - [FlatCC FlatBuffers in C for C](https://github.com/dvidelabs/flatcc) # Reference list - Machine Learning in Resource-Scarce Embedded Systems, FPGAs, and End-Devices: A Survey - CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs - [Tensorflow lite for microcontrollers](https://www.tensorflow.org/lite/microcontrollers) # 問題跟進度報告 :::info 進度報告 ::: :::warning 遇到的問題跟解法 ::: :::success 跟老師討論的問題與結果 ::: ## 2/17 :::info - 熟悉使用開發板跟 IDE - 閱讀開發板相關文件 ::: :::warning - 處理 dual-core 開機問題 1. CM7 等待 CM4 開機後進入 stop mode 2. CM4 開機後進入 stop mode 會換 CM7 起來初始化系統 3. CM7 初始化完成後會喚醒 CM4 4. CM7 等待 CM4 被喚醒後會繼續初始化週邊 5. CM4 被喚醒後會初始化系統 6. CM4 初始化完系統後會直接開始初始化週邊 7. CM4, CM7 各自初始化週邊後會直接開始執行接下來的程式碼 ::: ## 2/24 :::info - 閱讀 MCU 相關文件 - 一個 LED 燈的 example code - 一個 UART 的 example code ::: :::warning - [解決指定變數 memory address 的問題](https://mcuoneclipse.com/2012/11/01/defining-variables-at-absolute-addresses-with-gcc/) - ~~pragma location(IAR)~~ - \_\_attribute\_\_((section())) - linker script - 熟悉 [linker script](https://blog.louie.lu/2016/11/06/10%E5%88%86%E9%90%98%E8%AE%80%E6%87%82-linker-scripts/) 寫法 ::: ## 3/4 :::info - 閱讀 FreeRTOS Task management - 剛開始進行 TensorflowLite example ::: :::warning 以下問題是在做 FreeRTOS example 的時候遇到的： - 在 `vTaskStartScheduler()` 之前不能花很多時間，否則第一次 systick 觸發時如果還沒初始化完成 scheduler 會有錯誤 - 一開始就 disable interrupt 應該可以解決 - ISR 初始化不正確 - 修改 `stm32h7xx_it.c` - 兩個 core 一起用 uart 會造成錯誤 - 如果有 cross-core lock 機制應該可以解決 - CM7 core 送訊息送太慢會害 CM4 core 收不到訊息導致錯誤 - 這個錯誤是因為 CM7 印 UART 太慢導致，CM7 不印 UART 的話目前不會遇到 ::: ## 3/11 :::info - TensorflowLite bare metal example - Implement `void DebugLog(const char* s)` according to the hardware(unportable) ::: :::warning - `Undefined reference to 'vtable xxxxxxxxxx'` - 原本這個錯誤原因是有 virtual function 沒有辦法被 reference - 是因為自己新增的 *.cc source files 沒有被編譯到導致的 - 解法：CM7 專案點右鍵 → Properties → C/C++ General → Paths and Symbols → Source Location → Add folder ::: ## 3/18 :::info - 各個 model 及 operation libraries 編譯出的執行檔大小 ::: :::warning - 為什麼 `objdump -h` 印出來的 size 總和不等於 size of elf file - [ELF 檔案解析](https://www.itread01.com/content/1544800701.html) - symbol table 的內容不會列在裡面 - 可以用 `readelf -S` 查看 - 除了各個 section 的內容之外，檔案內還包含 file header, program header table, section header table - HardFault - tensor_arena 設太小，存取 tensor_arena 時超過 array 範圍，壓到 tensor_arena 上面的錯誤資料 - tensor_arena 設太大，存取 tensor_arena 時超過 stack，壓到 stack 下面的資料 ![](https://i.imgur.com/mA3etAO.png) - 直接改 linker script 使用其他區塊的 RAM 當 stack 會在硬體初始化會錯誤 - prvSetupHardware() - HAL_Init() - HAL_NVIC_SetPriorityGrouping(NVIC_PRIORITYGROUP_4) - Memory 讀出來都是 0，return $lr 時發生錯誤，因為 return address 是從 stack pop 出來的 - 原因是 CM4 進入 stop mode 導致，進入 stop mode 之前 memory 都是正常的，進入之後記憶體的值會全部變成 0(Debugger 觀察的)。 - 解決辦法是不要改變 stack 設定(stack 使用原本的記憶體區塊)，只改變放 tensor_arena 的記憶體區塊(AXI SRAM)(不要把 tensor_arena 當成 local variable 放在 stack 中)。 - AXI SRAM is mapped at address 0x2400 0000 and accessible by all system masters except BDMA through D1 domain AXI bus matrix. AXI SRAM can be used for application data which are not allocated in DTCM RAM or reserved for graphic objects (such as frame buffers) - 但是這樣做的話，tensor_arena 的空間也會被算入 executable file - [ ]讓它跟 .bss section 一樣不要在 executable 佔空間 ::: ## 3/25 :::info - 確認一下 type of model weight - 都是 kTfLiteUInt8 - 確認一下 operator 是整數運算還是浮點數運算 - 裡面也都是整數運算 - person_detect 執行時間 - 粗略估計 -O0 大約 39 秒左右 - 粗略估計 -O3 大約 4 秒左右 - 影響執行速度的包括層數，每層的 tensor 數 - mobilenet - 31 層 - 圖片大小 224\*224\*24 - person_detect - 29 層 - 圖片大小 96\*96\*8 - inference 時間的容忍度：看 edge-device 應用情境，我認為分鐘等級的時間在某些應用上是可行的：例如交通狀況的預測、天氣的預測 - Paper reading: CMSIS-NN: Efficient Neural Network Kernels for Arm Cortex-M CPUs - SIMD instructions, especially 16-bit Multiply-and-Accumulate (MAC) instructions (e.g. SMLAD) - NNFunctions and NNSupportFunctions - 軟體架構圖 ::: :::success - 透過網路更新 operation library - 怎麼去修改程式碼區塊，效益真的會大於成本嗎 - 直接寫 flash，網路下載的功能可以當作 baseline - model 也透過網路更換 - 沒錯 - File system, network stack 放在 kernel thread 裡面，還是要放在外面變成一個 process - 放在裡面效能比較好 - workload 可以多個 mobile，只要不同應用即可 ::: ## 4/1 :::success - 上次討論說 model 要透過網路更新，要怎麼去下達更新 model 的命令? - 透過 server 端對 edge-device 下命令 ::: :::info - 軟體架構圖 - [Two OS domain](https://www.embedded.com/take-advantage-of-multicore-with-a-multi-os-software-architecture/) 　　因為我們的硬體是 AMP 架構，所以我們會有兩個 OS domain。除此之外，因為 AMP 架構的關係，表示 process 沒有辦法 migrate。這樣我們就必須決定哪些 application 會執行在 M7 core 上，哪些會執行在 M4 core 上。最後實驗也要去觀察有沒有 unbalance 的情形。 - 為什麼要用 AMP 架構　　那時候沒有特別的去選，只是剛好挑出來是長這樣，這部份需要特別解釋嗎。這樣做的好處是不用去做 critical section protection，而且有一個專用的 core 來一直執行重要任務(inference)。唯一想到的解釋可能是現在很多應用上都會使用特殊的 DSP 來做需要高度運算的工作，再加上一個 general CPU(還找不到相關的論文)。 - 怎麼分配 applications 執行在哪個 core 上　　因為這個 work 主要就是要執行 AI inference 的任務，所以我認為大部份的時間都會是以執行 inference 為主，inference process 也應該在 M7 core 上執行。　　而這個 process 需要使用到 file system，所以在 M7 中的 OS kernel 要有 file system，而 support process 也會用到 file system，所以也運行在 M7 core 上。其餘的 application 運行在 M4 core 上。 :::success 1. support process 在 M7 上執行會有個問題，在更新 model 或 operation library 的時候不能做 inference。 - I/O bound - 只有第一次需要做這些事 2. 右邊的 task management 主要是用來管理 tflite inference process 和 support process。 ::: - File system 　　要解決置換 model 避免 ==recompilation== 的這個問題，我們勢必要把 model 從執行檔中分離出來，分離後就要找個地方儲存 model file。之後我們 inference 需要使用 model 時就要去讀取檔案，需要更新 model 時就需要寫入檔案，我們實做一個 file system 來完成這件事。　　之後會看到在設計中有兩個 processes，一個負責 inference(讀取 model)，一個負責更新 operation 及 model(讀寫 model)，file system 會被這兩個 processes 來使用。 - LittleFS - SPIFFS - FatFS - [Try LittleFS on STM32 and SPI flash](https://blog.csdn.net/yi412/article/details/86654063) :::success 跟老師討論後決定用 FAT file system，因為現在比較少在意 wear leveling，而且 FAT 比較常見 ::: - Tflite support process 　　這個 process 主要是要解決 ==dynamic operation library== 的問題，現有的 tflite for microcontroller library 需要使用者自行判斷 model 所使用的 operation 後，再手動修改程式碼去引入所需要的 operation，否則只能一次把所有 operation 都引入。如果使用前者的方法，對於開發者來說是非常不方便的，第一是使用者需要去知道 model 裡面用了哪些 operation，第二是當你更換 model 時必須手動修改程式碼在重新編譯上傳。如果使用後者的方法則是會增加 memory footprint。所以這個 process 可以在使用者載入 model 時自動判斷 model 使用了哪些 operation 並且自動下載更新 operation library。除此之外為了增加開發應用程式的便利性，我們也讓這個 process 加上可以透過下載的方式來新增或==替換 edge device 上 model== 的功能。 - Tflite inference process 　　原生的 tflite for microcontroller 除了上述的問題之外還有兩個缺點。第一個是記憶體管理的部份，現有框架需要開發者==手動去分配記憶體空間==給 library 使用，這個空間的大小是要透過開發者 trial and error 去獲得的，所以會造成不方便。第二個問題是現有框架==沒辦法在 device 上執行一個比記憶體大的 model==。所以這個 process 主要處理兩件事，第一就是可以在載入 model 時去自動計算所需的記憶體大小，這樣開發者就不用花費心力在這件事上面，也不用擔心浪費記憶體。第二就是可以透過 swap 的方式來執行一個大於現有記憶體的 model，而且因為 model 的 weight 是唯讀的，所以 swap 機制其實只是 load 不同區塊的檔案進記憶體，不需要存回 storage。 - [Network stack](https://savannah.nongnu.org/projects/lwip/) 　　為了更新 model 或是 operation library 所需要的功能，在現在的許多 edge-computing 上面，[OTA](https://searchmobilecomputing.techtarget.com/definition/OTA-update-over-the-air-update)也是一個很普遍的功能。補充資料：[嵌入式微控制器應用中的無線更新：設計權衡與經驗](https://www.eettaiwan.com/news/article/20190214TA71-over-the-air-ota-updates-in-embedded-microcontroller-applications)。 - Our tflite API 　　提供新的 API(類似於原本的 API) 給開發者使用，主要是簡化開發流程，省去一些繁瑣的步驟。 - Flash 當儲存裝置的速度 - eMMC(NAND) ![](https://i.imgur.com/cWNY4r1.png) - QSPI Flash(NOR) ![](https://i.imgur.com/DIBexad.png) - USB OTG(NAND) - FS (12 Mb/s) ::: ## 4/8 :::info - Workflow of startup - SDMMC 相關 spec - SD 卡比較常見，所以決定先用 eMMC 界面 - 最高速度可以到 104 MByte/s 以上 ::: :::success - [x]命名要盡量符合原本 tflite 的名稱嗎 - 例如 op_resolver, operation library - interpreter, inference control block - 可以不用，可以用自己覺得比較適合的，能一對一跟 tflite 對照的 data structure 在論文中註明就好 - [x]support process 的新名稱 - model updating process - [x]什麼時候處理 operation library - 下載 model 時 - [Easily Parse TFLite Models with Python](https://jackwish.net/tflite/) ::: ## 4/15 :::info - 確認 timer interrupt 是不是兩個 cpu 分開的 - timer register 是在 cpu core 裡面 - New architecture - heap memory management - model updating process - message buffer service - FreeRTOS heap memory management - Share memory 使用 - Inference control block - message buffer ::: :::success - [x]全部都用 c 去寫嗎，如果第三方套件沒有提供 c 版本的話？ - flatbuffer - gemmlowp - tflite operation library - 我們的 work 用 c 寫就好，其他維持原本，用 g++ 編譯 - [x]記憶體管理 - 直接使用 freertos 的 heap management - tflite for micro controller 內建的 allocator - 直接使用 freertos 的 heap management ::: ## 4/22 :::info - [x]查一下 read write tensor data 佔多少 ![](https://i.imgur.com/07x4xPy.png) ![](https://i.imgur.com/SK3p0gd.png) ![](https://i.imgur.com/svceDjF.png) ::: # 4/29 :::info - API prototype ::: :::success - [ ]Inference control block 動態配置還是靜態配置 - 一個 AI task 會有兩個 inference control block 嗎 - [x] data structure 命名方式 - 統一叫 inference control block - [x] 跟 FreeRTOS 的相依性 - 跟 FreeRTOS 相依 - API naming style - errorno - data type - [x] 資料夾結構 - 共同 data structure 要放在兩個檔案裡還是共用一個 header file - 共用一個，放在哪裡都可以 - 只是 CM4 那邊的 tflite service API 要放在 FreeRTOS 裡面還是外面 - 分開放 ``` +-My_Work The project root directory | +-Common Directory and files generated by IDE +-Drivers Directory and files generated by IDE +-CM7 | +-Core | +-Src +-Inc +-MyWork | +-flatbuffer +-gemmlowp +-Src +-Inc +-FreeRTOS | +-Src +-Inc +-port +-CM4 | +-Core +-FreeRTOS +-MyWork | +-Src +-Inc ``` ::: # 5/6 :::info - M4 generate interrupt to M7 - FlatBuffers 解釋 ::: :::success - [x]flatbuffer 解決方法 - [x]先用 PC 轉成自己喜歡的 file format，再拿來使用，這樣就可以使用 flatbuffer library ::: # 5/13 :::info - core to core communication with message buffer ::: # 5/20 :::info - 使用 eMMC card 儲存資料 - [SD protocol](https://www.itread01.com/content/1541934448.html) - [SD 命令列表](http://www.zeroplus.com.tw/E-paper/200907/image/SD_command%20and%20register%20list.pdf) ::: :::warning - SD card 在接收或傳送的時候不能被中斷(也不能用 debug 設斷點)，否則會 fifo overrun & underrun error - RAM 的位置(要 DMA 可以存取的位置) - peripheral bus 速度不能比卡片速度慢 ::: # 6/3 :::info - 加入 FatFs - 改寫 FatFs 底層 MMC driver - 為了多一點的彈性不用 BSP，全部用 HAL 實做 - polling 改成 DMA - 網路傳輸部份: LwIP ::: :::warning - ==不能在 DMA 完成傳送前送下一個命令== ::: # 6/10 :::info - DHCP 成功 - ICMP 成功 - TCP 成功 ::: :::success - PC 主動上傳檔案還是 embedded device 來 PC 下載 ::: :::warning - Out of order memory access，啟動 DMA 前要加 `__DSB()` ::: # 6/17 :::info - 下載檔案完成 - model file converter(*.model) ::: # 7/8 :::warning write SD card 要注意 alignment, DMA 要求要對齊 32bits 他的 driver 沒有檢查是因為他原本不是用 DMA 方式寫 SD card ::: :::info - new model format - inference interface - flow of two processor ::: # 7/15 :::info - 在 operation init 階段可能會需要 AllocatePersistentBuffer - 在 operation prepare 階段可能會需要 AllocatePersistentBuffer 跟 RequestScratchBufferInArena - 在 operation invoke 階段可能會需要 GetScratchBuffer ::: :::warning - printf %f 因為 STM32CubeIDE 使用的 newlib 有錯誤，所以會導致 crash http://www.nadler.com/embedded/newlibAndFreeRTOS.html ::: # 7/22 :::info - 報告執行流程 - 自動計算 tensor arena 大小 ::: # 7/29 :::info 到目前為止的貢獻 - 可以透過網路上傳 model，不用將 model 跟 program 編譯在一起。可以方便快速的更新 model - 使用新的 model format，使得 model file 不用全部載入 memory。另外新的 model format 檔案大小比原本 tflite model 稍微小一些些，(開發板上處理 model 的程式碼大小也比整個 flatbuffer library 小)。 - 自動計算並且配置最小 read-write tensor 需要的空間，以前使用者要一直不停測試記憶體的最小需求量，每次測試都需要重新編譯。 - 不一次把所有 read-only tensor 的資料載入記憶體，節省空間 - 重新寫過 tflite code，改善原本複雜的資料結構(duplicated data in structures)，現在的架構也更容易搭配多個 CPU 來分工 - (更容易讓 embedded device 執行 multi-task)(只是 tflite 本身的初衷就是不要 OS，所以這個點沒什麼好打的，也比較難論述) ::: :::success - Algorithm of memory allocation for read-only tensor - memory planner that is like read-write tensor - 為什麼 read-write tensor 不 swap，只有 read-only tensor 有 - 最後是沒有 swap 的，allocate/free 浪費很多時間 - update operation library - future work? ::: :::warning - 有時候重複遇到硬體設備初始化錯誤，要把開發板斷電一段時間在重試看看，常見的是 SDMMC 初始化會卡住。原因推測是 I/O 如果有 bug 會影響 SDMMC controller 不斷電或是斷電時間太短，會導致 controller 沒有 reset。 ::: :::info - 因為 memory planner 資料會重疊，所以如果要預先載入下層資料的話，每個 tensor 的 lifetime 都要從上一層開始算，這樣 plan 出來需要的空間會比較大。 - FatFs 沒有 non-block IO ::: # 8/12 :::info - 實做 blcok I/O - 測量 memory 的方式 - 實驗圖 ::: # 8/27 :::info - convnet quantized, mobilenet_v2 扣有 bug，filter 是 uint8 但 I/O tensor 是 float 32 - tensorflow 新版本有修改一些舊版的 bug，但是軟體架構也有改變 - 不支援 scratch buffer，int8 的 reduce 會用到 ::: # 9/9 - paper outline - experiment - 圖表， workload 名稱 - 要不要扣掉 FreeRTOS FatFS 使用的記憶體 - 跟 tflite for microncontroller 比 - stmcube.ai - onnc --- # 9/25 - CM7 的 task stack 要放在 DTCM 裡面，比較快 - CM7 的 framework data structure 不能放在 DTCM 裡面，因為 CM4 不能存取那邊 # 10/3 - DMA 完要 invalidate cache，但是因為 alignment 的問題會 invalidate 到其他還沒寫回 memory 的東西導致錯誤 # 10/6 - compiler 優化會影響實驗結果 - 使用 cache 整體來看時間會縮短 - 有沒有 dynamic loading 影響不大，反而可能因為 compiler 優化影響執行時間 - 有沒有 pipeline dynamic loading 影響不大(ms 數量級變 10ms 數量級)，只有 mobilenet 看得出來，其他也都是因為 compiler 優化影響執行時間 --- - [ ] model converter 沒管 endian - [ ] 先不做 custom opcode - [ ] 先不做 dynamical operation library - [ ] 目前 D-Cache 是關掉的 - [ ] 測一下 FatFs 讀取速度 - [ ] 使用 CMSIS - [ ] 網路傳送 command - [ ] ONNC memory planner - [x] 找新 workload - [x] 在等 DMA 完成前 task 應該不能佔有 CPU 時間