(2019/07/26)Deephi 3.0 操作手順

(2019/07/26)Deephi 3.0 操作手順 === ###### tags: `Xilinx` - author: Kaiden Yu - update: - 190510:初版 - 190513:新增"D. Deephi範例實做"章節 - 190520:新增"E. Deephi範例程式說明"章節的"dnndk_example/resnet50"小節 - 190523:新增"E. Deephi範例程式說明"章節的"dnndk_example/segmentation"小節 - 190524:新增"E. Deephi範例程式說明"章節的"dnndk_example/yolov3"小節 - 190529:新增"E. Deephi範例程式說明"章節的"framebuffer驅動使用(/dev/fbX，X是數字)"小節 - 190625:補上target帳密以及ssh遠端控制方式 - 190626:加入SSD範例參考網址 - 190726:新增"C. Deephi環境建構"章節的"架target"小節中的ssh X11 forwarding - [reference:SSD範例程式](https://github.com/Xilinx/Edge-AI-Platform-Tutorials/tree/master/docs/ML-SSD-PASCAL/ref_training) --- ### A. 測試環境 - host: ubuntu 16.04 - target: ZCU104 公板 --- ### B. 前置作業 #### xilinx網址 - https://www.xilinx.com/products/design-tools/ai-inference/ai-developer-hub.html#edge - ![](https://i.imgur.com/2ZB2jdt.png) - 包含相關說明文件及檔案 - 後續內容若無特別標明則表示由此網址下載 --- #### 下載 1. DNNDK: xlnx_dnndk_v3.0_190430.tar.gz 2. Linux映像檔: xilinx-zcu104-prod-dpu1.4-desktop-buster-2019-04-23.img.zip 3. flash工具：balena-etcher-electron - 下載網址:https://www.balena.io/etcher/ (參考ug1327-dnndk-user-guide.pdf P15 ) --- #### 下載檔案說明 1. DNNDK：包含三個檔案 1. DNNDK 2. Xilinx AI SDK 3. README file: **請先看過** - ![](https://i.imgur.com/gWW6O89.png) 2. Linux映像檔: 開機檔案與linux OS with AI support 3. flash工具: 負責將上述Linux 映像檔燒到SD卡的工具 --- #### flash工具 - balena ![](https://i.imgur.com/alPX6s7.png) - 操作流程(參考ug1327 P15-P17 ) 1. 選擇映像檔(及上述下載章節中的Linux映像檔,該檔下載後不必解壓縮) 2. 選擇SD卡 3. 執行flash --- ### C. Deephi環境建構 #### 架target ![](https://i.imgur.com/dSwJjxa.png) - 架設現況(參考ug1327 P8) - 放flash好的SD卡到卡槽 - 輸出使用DisplayPort接螢幕，不是HDMI - USB port接一個HUB，然後接鍵盤滑鼠，直接進行操作，不用透過與host接UART進行操作 - 帳密：root/root - Ethernet使用有線 - 點選右下角網路icon - ![](https://i.imgur.com/V0CBzWU.png) - 會發現IP是192.168.200.XXX -> 有問題 - ![](https://i.imgur.com/9r858Z8.png) - 按disconnect再按connect - IP會變成192.168.10.XXX -> 沒問題 - ![](https://i.imgur.com/ecWRh7X.png) - 上述步驟完成網路就ok了 - host無論使用有線或無線網路可以ping到target(所以所有host都可以用scp傳檔案到target) - 可以使用ssh遠端連線控制 ``` $ ssh root@192.168.10.XXX ``` - ssh可以設定X11 forwarding，可以在host直接看到target上程式用opencv的imshow的畫面(framebuffer做display就無法)。在/etc/ssh/ssh_config，最下面加入三行相關設定即可([參考資料](http://www.ubuntu-tw.org/modules/newbb/viewtopic.php?viewmode=compact&topic_id=52608&forum=22)) ``` ForwardAgent yes ForwardX11 yes ForwardX11Trusted yes ``` ![](https://i.imgur.com/atW0tGX.png) - target只能ping到使用有線的host(但沒差) --- #### DNNDK(host) - 安裝caffe 1.0 (自行參考網路) - 安裝caffe相依檔案(參考ug1327 P8) 指令如下 ``` sudo apt-get install -y --force-yes build-essential autoconf libtool libopenblas-dev libgflags-dev libgoogle-glog-dev libopencv-dev protobuf-compiler libleveldb-dev liblmdb-dev libhdf5-dev libsnappy-dev libboost-all-dev libssl-dev ``` ![](https://i.imgur.com/STgDq3e.png) - 執行shell script(參考ug1327 P9) ``` sudo ./install.sh ZCU104 ``` - 由上圖可看出,decent只support CUDA8.0、9.0和9.1，因為我使用的電腦顯卡等級太低,無法安裝,故只能安裝cpu版本，cpu版本的decent指令為decent-cpu - DECENT版本確認(參考ug1327 P21-P22) ``` decent-cpu --version ``` ![](https://i.imgur.com/4mgs2Dw.png) - DNNC版本確認(參考ug1327 P21-P22) ``` dnnc --version ``` ![](https://i.imgur.com/G1ek5Eu.png) --- #### DNNDK(target) - 將DNNDK中的ZCU104資料夾copy到SD卡的檔案系統上，例如/home/linaro/ZCU104(參考ug1327 P19-P20) ![](https://i.imgur.com/jHqIE1U.png) - 若網路有通可以使用scp指令上傳到板子(假設板子的ip為192.168.10.147) ``` scp -r ./ZCU104 root@192.168.10.147:~/ ``` - 傳資料夾及資料夾內的東西要加"-r"若是傳檔案則不用(r表recursive) - 下完指令會問yes/no，輸入y - 會問板子root的密碼:輸入root - 會傳到板子/root資料夾下 - 板子上電後，進到ZCU104下，執行install.sh(參考ug1327 P21-P22) ``` ./install.sh ``` --- #### XILINX_AI_SDK(host) ![](https://i.imgur.com/fIrqs6K.png) - 安裝host環境(參考g1354 P7-P8) 1. 進入XILINX_AI_SDK資料夾 2. 執行petalinux_sdk.sh(不要加sudo!!!) ``` ./petalinux_sdk.sh ``` ![](https://i.imgur.com/cYHXJHa.png) - 看上圖可知預設安裝路徑為/opt/petalinux/2018.2 - 請先確定/opt的owner不是root - 若/opt的owner是root及使用chown改為user(我所使用的電腦user名稱為cme) ``` sudo chown -R cme:cme /opt ``` ![](https://i.imgur.com/IuF0PDa.png) 3. 執行petalinux裡的 environment-setup-aarch64-xilinx-linux(不要加sudo!!!) - **此步驟每次重開機或新開一個terminal都要做!!!!!!!它是一個script** ``` . /opt/petalinux/2018.2/environment-setup-aarch64-xilinx-linux ``` - 安裝完petalinux後environment-setup-aarch64-xilinx-linux為不可執行檔請先用chmod改 ``` sudo chmod 777 environment-setup-aarch64-xilinx-linux ``` ![](https://i.imgur.com/eP1HNRZ.png) 4. 進到ZCU104資料夾並安裝Xilinx AI SDK(不要加sudo!!!且要先執行3.才能執行這一步驟，因為3.會export一些path) ``` cd ZCU104 ./INSTALL_XILINX_AI_SDK.sh ``` - example的cross compile(參考g1354 P8) 1. 進入example code資料夾 ``` cd /opt/petalinux/2018.2/sysroots/aarch64-xilinx-linux/usr/share/XILINX_AI_SDK/samples/classification ``` 2. 執行build.sh以進行編譯 ``` sh build.sh ``` --- #### XILINX_AI_SDK(target) - 安裝target環境(參考g1354 P8-P9) 1. 進入ZCU104資料夾 ``` cd XILINX_AI_SDK-V1.0.0-BUILD-11-2019-04-26/ZCU102 ``` 2. 把XILINX_AI_SDK-ZCU102.tar.gz複製到SD卡上 ![](https://i.imgur.com/tmonR0V.png) - 若網路有通可以使用scp指令上傳到板子(假設板子的ip為192.168.10.147) ``` scp XILINX_AI_SDK-ZCU102.tar.gz root@192.168.10.147:~/ ``` 3. SD卡插到板子上開機然後在板子上解壓縮 ``` tar zxvf XILINX_AI_SDK-ZCU102.tar.gz -C / ``` **!!! 重要 !!!**: 使用cross compile 也要執行上述步驟否則執行檔無法執行 - example在target上直接compile(參考g1354 P9) 1. 進到想要測試的example code的資料夾 ``` cd /usr/share/XILINX_AI_SDK/samples/facedetect ``` 2. 執行build.sh進行編譯 ``` sh build.sh ``` - 部份有預先編譯過了所以可以省掉這個步驟 3. 執行example ``` ./test_jpeg_facedetect_dense_box_640x360 sample_facedetect_dense_box_640x360.jpg ``` - example code資料夾中會編譯出多個執行檔請參考readme了解執行方式 --- ### D. Deephi範例實做 #### DNNDK工具實做 - decent(參考ug1327 P35-P36) - 使用decent host資料夾下的resnet50 model做練習(xlnx_dnndk_v3.0_190430/xilinx_dnndk_v3.0/host_x86/models/caffe/resnet50) - 此工具是用來做model壓縮，主要做兩件事 - quantization: floating point to fix point - pruning: 刪除不重要的node(網路直接下載的不支援此部份要付錢才有) - 準備好caffe train好的model檔(有兩個) ![](https://i.imgur.com/POXzXxv.png) - 準備好calibration data - 100~1000張圖和一個txt檔，裡面列著每張圖的檔名和順序 - 此部份使用FAE Alex提供的檔案 ![](https://i.imgur.com/PGiybXM.png) ![](https://i.imgur.com/8Z7CThv.png) - 修改caffe train好的model檔中的.prototxt中的source和root_folder - ![](https://i.imgur.com/oQUQwSx.png) - 執行decent.sh(請看script內容) - 這個script會自動幫你確認要用decent(GPU版本)還是decent-cpu(cpu版本) - script已經都把decent參數設定好，參數意義請參考文件 ``` ./decent.sh ``` - ![](https://i.imgur.com/4rSZYSD.png) - 完成畫面如下 - ![](https://i.imgur.com/8I1Rj0M.png) - 可以發現產生一個decent_output資料夾，裡面包含deploy.caffemodel和deploy.prototxt這兩個壓縮後的model檔案 - ![](https://i.imgur.com/VxGlgXM.png) - dnnc(參考ug1327 P38-P39) - 直接執行dnnc.sh(請看script內容) - 這個script會自動幫你確認for那一個target(ZCU102,ZCU104,Ultra96) - script已經都把decent參數設定好，參數意義請參考文件 ``` ./dnnc.sh ``` - ![](https://i.imgur.com/62mG3zU.png) - 上圖compiling Network後面的名稱"resnet50" 就是dnnc.sh裡net變數的名稱，kernel name也會依據這個名稱在後面加上底限和編號->"resnet50_0" - 可以發現產生一個dnnc_output資料夾，裡面包含dpu_resnet50_0.elf(machine code for DPU)和resnet50_kernel_graph.gv這兩個檔案 - 為什麼沒有產生dpu_resnet50_1.elf?因為看上圖kernel id 1那個部份，他的type是CPUKernel，那就表示這是DPU不支援的部份，所以部會產生DPU的machine code，必須自己用C code寫在app裡面 - ![](https://i.imgur.com/96a2cZ7.png) - elf檔轉so檔與建立configure file(參考ug1354 P52-P53) - elf檔轉so檔 - 目的是將上述dpu_resnet50_0.elf轉為.so - 使用自製shell script(elt_to_so.sh) ``` DIR=/home/Ultrascale_plus_MPSOC/Deephi_3.0/xlnx_dnndk_v3.0_190430/xilinx_dnndk_v3.0/host_x86/models/caffe/kaiden_resnet50/dnnc_output MODEL_NAME=resnet50 aarch64-xilinx-linux-g++ -nostdlib -fPIC -shared \ ${DIR}/dpu_${MODEL_NAME}*.elf \ -o ${DIR}/libdpumodel${MODEL_NAME}.so || touch ${DIR}/libdpumodel${MODEL_NAME}.so ``` - 若是在host轉，在script中使用cross compiler的指令aarch64-xilinx-linux-g++ - 若是在target轉，在script中直接用g++ - DIR是dnnc_output資料夾的位址 - MODEL_NAME與dnnc.sh中的"net"變數相同如下兩圖所示 - ![](https://i.imgur.com/1imug7d.png) ===================分隔 ================= - ![](https://i.imgur.com/g9QzrLm.png) - 最後產出libdpumodelresnet50.so - **.so檔的名稱有相依性，錯了會不能運行** - configure file - 如果是target上操作，參考/etc/XILINX_AI_SDK.conf.d/內的檔案，複製一份來改，改完就留在etc/XILINX_AI_SDK.conf.d - 如果是host上操作，假如當初在執行petalinux_sdk.sh是使用預設路徑，則XILINX_AI_SDK.conf.d在/opt/petalinux/2018.2/sysroots/aarch64-xilinx-linux/etc/XILINX_AI_SDK.conf.d - 此步驟還不大清楚實際作用 --- #### Xilinx AI SDK example code實做(host or target) - example樹狀圖 - ![](https://i.imgur.com/VJYE7OA.png) - 其中除了dnndk_example以外，其他都是abstraction(AI API)實做，程式內容只有幾行，不適合奇美使用 - dnndk_example樹狀圖 - ![](https://i.imgur.com/CGh6Nab.png) - 選擇reset50這個example來實做，因為dnndk中只有提供resnet和inception這兩種model範列可以進行decent和dnnc操作，為了驗證decent和dnnc操作流程及產生的.so沒問題，故選擇這個example以驗證.so檔確實可執行 - example"resnet50"修改與執行 - 接序上一章節產生的libdpumodelresnet50.so，須將這個檔案放到target的/usr/lib/內 - 用習慣使用的編輯器打開test_dnndk_resnet50.cpp - 可以看到#define KERNEL_CONV "resnet_50" - ![](https://i.imgur.com/aBrP0Ip.png) - 這個地方要與dnnc.sh執行後產生的訊息中的kernel name一致，故改為"resnet50_0"(dnnc.sh執行後產生的訊息參考下圖) - ![](https://i.imgur.com/vsvlLxk.png) - 改好後儲存並執行build.sh ``` sh build.sh ``` - 若是在host使用cross compile記得要先執行environment-setup-aarch64-xilinx-linux否則出錯 ``` . /opt/petalinux/2018.2/environment-setup-aarch64-xilinx-linux ``` --- ### E. Deephi範例程式說明 - 程式片段依照實際程式碼由上而下排列 - 說明寫在程式片段下方 - 為了方便說明，指定程式行數，與實際程式行數無對照關係 - 說明可能以 **${數字}:** 表示參考之程式行數，若是有多行則以兩個數字並以 **"~"** 分開表示 --- #### dnndk_example/resnet50 ![](https://i.imgur.com/KWKH9MU.png) - test_dnndk_resnet50.cpp ```c=1 #include <assert.h> #include <dirent.h> #include <stdio.h> #include <stdlib.h> #include <atomic> #include <sys/stat.h> #include <unistd.h> #include <cassert> #include <chrono> #include <cmath> #include <cstdio> #include <fstream> #include <iomanip> #include <iostream> #include <queue> #include <mutex> #include <string> #include <vector> #include <thread> #include <mutex> #include <dnndk/dnndk.h> #include <opencv2/opencv.hpp> using namespace cv; using namespace std; using namespace std::chrono; int threadnum; mutex mutexshow; #define KERNEL_CONV "resnet_50" #define CONV_INPUT_NODE "conv1" #define CONV_OUTPUT_NODE "fc1000" const string baseImagePath = "./image/"; ``` - 33: kernel的名字，要跟dnnc的message中寫的kernel一名稱一致 - 34~35: 也可在dnnc的message中看到 ```c=1 /*kaiden: 可切換直接執行function或執行且計算執行時間(把#define SHOWTIME打開就行)*/ //#define SHOWTIME #ifdef SHOWTIME #define _T(func) \ { \ auto _start = system_clock::now(); \ func; \ auto _end = system_clock::now(); \ auto duration = (duration_cast<microseconds>(_end - _start)).count(); \ string tmp = #func; \ tmp = tmp.substr(0, tmp.find('(')); \ cout << "[TimeTest]" << left << setw(30) << tmp; \ cout << left << setw(10) << duration << "us" << endl; \ } #else #define _T(func) func; #endif ``` - 透過 **#ifdef #else** 和 **#define SHOWTIME** 來切換marco **#define _T(func) func**，若有define SHOWTIME則 _T(func)會有統計運算時間的功能，反之則直接執行func而已 ```c=1 void ListImages(string const &path, queue<string> &images) { struct dirent *entry; /*Check if path is a valid directory path. */ struct stat s; lstat(path.c_str(), &s); if (!S_ISDIR(s.st_mode)) { fprintf(stderr, "Error: %s is not a valid directory!\n", path.c_str()); exit(1); } DIR *dir = opendir(path.c_str()); if (dir == nullptr) { fprintf(stderr, "Error: Open %s path failed.\n", path.c_str()); exit(1); } while ((entry = readdir(dir)) != nullptr) { if (entry->d_type == DT_REG || entry->d_type == DT_UNKNOWN) { string name = entry->d_name; string ext = name.substr(name.find_last_of(".") + 1); if ((ext == "JPEG") || (ext == "jpeg") || (ext == "JPG") || (ext == "jpg") || (ext == "PNG") || (ext == "png")) { images.push(name); } } } closedir(dir); } ``` - 此函式用來掃資料夾內的圖檔 - 5: [參考struct stat:](http://c.biancheng.net/cpp/html/326.html) - 6: [參考lstat()](https://blog.csdn.net/yzy1103203312/article/details/77878486)，其中path為baseImagePath，宣告在global - 7~10: 看看是否為資料夾，不是就exit - 12: 打開資料夾 - 18~27: 掃者個資料夾內的檔案，並把檔名存到string name，再把副檔名存到ext，若附檔名為JPEG、jpeg、JPG、jpg、PNG、png則把該檔名存到images - 29: 關閉資料夾 ```c=1 void LoadWords(string const &path, vector<string> &kinds) { kinds.clear(); fstream fkinds(path); if (fkinds.fail()) { fprintf(stderr, "Error : Open %s failed.\n", path.c_str()); exit(1); } string kind; while (getline(fkinds, kind)) { kinds.push_back(kind); } fkinds.close(); } ``` - 沒用到這個funciton ```c=1 void TopK(const float *d, int size, int k, vector<string> &vkinds, string name) { assert(d && size > 0 && k > 0); priority_queue<pair<float, int>> q; for (auto i = 0; i < size; ++i) { q.push(pair<float, int>(d[i], i)); } cout << "\nLoad image: " << name << endl; for (auto i = 0; i < k; ++i) { pair<float, int> ki = q.top(); printf("[Top %d] prob = %-8f name = %s\n", i, d[ki.second], vkinds[ki.second].c_str()); q.pop(); } return; } ``` - 把可能性最高的k個分類結果print出來 - 3: 宣告一個存pair<float, int>的priority_queue **q** - 5~7: 把size個pari(d[i],i)放到q裡 - 11~15: 把q裡的k個top pair個別丟到ki並把相關資訊print出來 ```c=1 vector<string> kinds; // Storing the label of Resnet_50 queue<string> images; // Storing the list of images void run_resnet_50(DPUTask *taskConv, const Mat &img, string imname) { assert(taskConv ); // Get the number of category in Resnet_50 int channel = dpuGetOutputTensorChannel(taskConv, CONV_OUTPUT_NODE); // Get the scale of classification result float scale = dpuGetOutputTensorScale(taskConv, CONV_OUTPUT_NODE); vector<float> smRes (channel); int8_t* fcRes; // Set the input image to the DPU task _T(dpuSetInputImage2(taskConv, CONV_INPUT_NODE, img)); // Processing the classification in DPU _T(dpuRunTask(taskConv)); // Get the output tensor from DPU in DPU INT8 format DPUTensor* dpuOutTensorInt8 = dpuGetOutputTensorInHWCInt8(taskConv, CONV_OUTPUT_NODE); // Get the data pointer from the output tensor fcRes = dpuGetTensorAddress(dpuOutTensorInt8); // Processing softmax in DPU with batchsize=1 _T(dpuRunSoftmax(fcRes, smRes.data(), channel, 1, scale)); mutexshow.lock(); // Show the top 5 classification results with their label and probability _T(TopK(smRes.data(), channel, 5, kinds, imname)); mutexshow.unlock(); } ``` - 2: images是ListImages()的第二個argument，在classifyEntry()中也有被用到 - 6: Get the channel dimension of one DPU Task’s output Tensor - 8: Get the scale value of DPU Task’s output Tensor. For each DPU output Tensor, it has one unified scale value indicating its quantization information for reformatting between data types of INT8 and FP32. - 12: set DPU Task input image without specify the mean value - 14: Launch the running of DPU Task - 16: Get DPU Task’s output Tensor and store them under DPU order (channel/height/width) in INT8 format - 18: Get the start address of one DPU Tensor，存到fcRes(int8_t*型別) - 20: 執行softmax，因為DPU不support softmax，所以要用軟體寫，這一版release已經幫我們包成函式dpuRunSoftmax，自己寫都省了 - 21: semaphore lock住，multithread避免請資源 - 23: show出top 5的分類結果 - 24: semaphore unlock ```c=1 void classifyEntry(DPUKernel *kernelconv) { ListImages(baseImagePath, images); // Load the list of images to be classified if (images.size() == 0) { cerr << "\nError: Not images exist in " << baseImagePath << endl; return; } else { cout << "total image : " << images.size() << endl; } thread workers[threadnum]; auto _start = system_clock::now(); int size = images.size(); for (auto i = 0; i < threadnum; i++) { workers[i] = thread([&,i]() { // Create DPU Tasks for Resnet_50 from DPU Kernel DPUTask *taskconv = dpuCreateTask(kernelconv, 0); while(true){ string imageName = images.front(); if(imageName == "") break; images.pop(); Mat image = imread(baseImagePath + imageName); // Classifying single image run_resnet_50(taskconv, image, imageName); } // Destroy DPU Tasks & free resources dpuDestroyTask(taskconv); }); } // Processing multi-thread classification for (auto &w : workers) { if (w.joinable()) w.join(); } auto _end = system_clock::now(); auto duration = (duration_cast<microseconds>(_end - _start)).count(); cout << "[Time]" << duration << "us" << endl; cout << "[FPS]" << size*1000000.0/duration << endl; } ``` - 2: 呼叫ListImages()把baseImagePath資料夾內的圖檔資料存到images裡 - 10: 宣告thread陣列 workers[threadnum] - 13~31: 定義thread的工作，就是把images pop出來(圖檔檔名)，在用imread讀到Mat image裡，然後呼叫run_resnet_50，當images所列的圖都處理完畢就會break;然後執行dpuDestroyTask - 34~36: join thread，等thread做完後join(reap) - thread是C++的函式庫提供的，DPUtask是DPU的專屬函式庫提供的，這邊將兩者混著用，一個thread搭配一個DPUtask - [c++11 STL Thread](https://kheresy.wordpress.com/2012/07/06/multi-thread-programming-in-c-thread-p1/)，須include thread這個header ```c=1 void readTxt(string file) { ifstream infile; infile.open(file.data()); assert(infile.is_open()); string s; while(getline(infile,s)){ kinds.emplace_back(s); } infile.close(); } ``` - 此函式用來讀word_list.txt的內容 - word_list.txt存了所有類別的英文字串 - 9: emplace_back為c++的vector相關用法，是push_back的優化版 ```c=1 int main(int argc ,char** argv) { setenv("DPU_COMPILATIONMODE","1",1); threadnum = 5; //The number of thread with the highest efficiency readTxt("./word_list.txt"); //Importing the Res50 label text /* The main procress of using DPU kernel begin. */ DPUKernel *kernelConv; dpuOpen(); // Create the kernel for Resnet_50 kernelConv = dpuLoadKernel(KERNEL_CONV); // The main classification function classifyEntry(kernelConv); // Destroy the kernel of Resnet_50 after classification dpuDestroyKernel(kernelConv); dpuClose(); /* The main procress of using DPU kernel end. */ return 0; } ``` - 3: 照寫 - 5: 呼叫readTxt()，把word_list的東西存到kinds(是vector) - 8: 開啟dpu - 10: 載入dpu kernel(及dpu machine code) - 12: 呼叫classifyEntry() - 14: Destroy a DPU Kernel and release its associated resource - 15:關閉dpu --- #### dnndk_example/segmentation ![](https://i.imgur.com/ICKfKF9.png) - test_dnndk_segmentation.cpp ```c=1 #include <assert.h> #include <dirent.h> #include <stdio.h> #include <stdlib.h> #include <atomic> #include <sys/stat.h> #include <unistd.h> #include <cassert> #include <chrono> #include <cmath> #include <cstdio> #include <fstream> #include <iomanip> #include <iostream> #include <queue> #include <mutex> #include <string> #include <vector> #include <thread> #include <mutex> #include <dnndk/dnndk.h> #include <opencv2/opencv.hpp> using namespace cv; using namespace std; using namespace std::chrono; #define KERNEL_CONV "fpn_deconv" #define CONV_INPUT_NODE "conv1_7x7_s2" #define CONV_OUTPUT_NODE "toplayer_p2" uint8_t colorB[] = {128, 232, 70, 156, 153, 153, 30, 0, 35, 152, 180, 60, 0, 142, 70, 100, 100, 230, 32}; uint8_t colorG[] = {64, 35, 70, 102, 153, 153, 170, 220, 142, 251, 130, 20, 0, 0, 0, 60, 80, 0, 11}; uint8_t colorR[] = {128, 244, 70, 102, 190, 153, 250, 220, 107, 152, 70, 220, 255, 0, 0, 0, 0, 0, 119}; // comparison algorithm for priority_queue class Compare { public: bool operator()(const pair<int, Mat> &n1, const pair<int, Mat> &n2) const { return n1.first > n2.first; } }; // input video VideoCapture video; // flags for each thread bool is_reading = true; bool is_running_1 = true; bool is_running_2 = true; bool is_displaying = true; queue<pair<int, Mat>> read_queue; // read queue priority_queue<pair<int, Mat>, vector<pair<int, Mat>>, Compare> display_queue; // display queue mutex mtx_read_queue; // mutex of read queue mutex mtx_display_queue; // mutex of display queue int read_index = 0; // frame index of input video int display_index = 0; // frame index to display ``` - 32~37: 可以看出這個segmentation網路能辨識19種不同的東西，這三個array分別代表每一個種類畫OSD的顏色，例如colorB[1]、colorG[1]、colorR[1]表示第二種類的B、G、R - 48: opencv的東西不多說 - 51~54: 表示各個flag - 56: read_queue:存讀近來frame的queue - 57: display_queue:存要被display出去的frame的queue(請查priority_queue用法) - 58: read queue的semaphore，read thread和兩個dpu task thread搶read queue資源，所以要用這個semaphore - 59: display queue的semaphore，display thread和兩個dpu task thread搶display queue資源，所以要用這個semaphore - 60: read index - 61: write index ```c=1 /** * @brief entry routine of segmentation, and put image into display queue * * @param task - pointer to Segmentation Task * @param is_running - status flag of the thread * * @return none */ void runSegmentation(DPUTask *task, bool &is_running) { // initialize the task's parameters DPUTensor *conv_in_tensor = dpuGetInputTensor(task, CONV_INPUT_NODE); int inHeight = dpuGetTensorHeight(conv_in_tensor); int inWidth = dpuGetTensorWidth(conv_in_tensor); DPUTensor *conv_out_tensor = dpuGetOutputTensor(task, CONV_OUTPUT_NODE); int outHeight = dpuGetTensorHeight(conv_out_tensor); int outWidth = dpuGetTensorWidth(conv_out_tensor); int8_t *outTensorAddr = dpuGetTensorAddress(conv_out_tensor); // Run detection for images in read queue while (is_running) { // Get an image from read queue int index; Mat img; mtx_read_queue.lock(); if (read_queue.empty()) { mtx_read_queue.unlock(); if (is_reading) { continue; } else { is_running = false; break; } } else { index = read_queue.front().first; img = read_queue.front().second; read_queue.pop(); mtx_read_queue.unlock(); } // Set image into CONV Task with mean value dpuSetInputImage2(task, (char *)CONV_INPUT_NODE, img); // Run CONV Task on DPU dpuRunTask(task); Mat segMat(outHeight, outWidth, CV_8UC3); Mat showMat(inHeight, inWidth, CV_8UC3); for (int row = 0; row < outHeight; row++) { for (int col = 0; col < outWidth; col++) { int i = row * outWidth * 19 + col * 19; auto max_ind = max_element(outTensorAddr + i, outTensorAddr + i + 19); int posit = distance(outTensorAddr + i, max_ind); segMat.at<Vec3b>(row, col) = Vec3b(colorB[posit], colorG[posit], colorR[posit]); } } // resize to original scale and overlay for displaying resize(segMat, showMat, Size(inWidth, inHeight), 0, 0, INTER_NEAREST); for (int i = 0; i < showMat.rows * showMat.cols * 3; i++) { img.data[i] = img.data[i] * 0.4 + showMat.data[i] * 0.6; } // Put image into display queue mtx_display_queue.lock(); display_queue.push(make_pair(index, img)); mtx_display_queue.unlock(); } } ``` - 11~13: 取得input tensor，再取得其height和weight - 15~17: 取得output tensor，在取得其height和weight和address - 21~69: is_running==true就一直做 - 25~33: 先把semaphore lock住，如果read_queue沒東西就unlock，再判斷看is_reading，如果true重跑while loop，反之把is_running設成false然後離開while loop - 34~39: 如果read_queue有東西(影像frame)則，則把queue中的front(第一張或最後一張?)資料存起來(把index和還有Mat存起來改後面程式用)，然後再把它pop掉，最後semaphore unlock - 42: 把上面存的Mat指定給dpu用 - 45: 開始執行dpu - 47~48: segMat是給dpu的output用的，showMat是display用的 - 49~56: dpu的output tensor想像成一個三圍的東西，寬是outwidth、高是outheight、深度是19；寬高表示位置，深度19表示有19類，裡面存個別的機率之類的，max_element就是把19類中機率最大的位址傳回來，distance(outTensorAddr+i,max_ind)可以知道是第幾類(大概就是做max_ind減outTensorAddr+i這種運算，然後再把該類所代表的顏色畫到segMat上 - 52: Q:為什麼是+19而不是+18?0~18才是19個不是嗎? A:max_element函式就是這樣規定(base,base+length)，沒有為什麼 - 59: 把segMat resize然後存到showMat - 60~62: 把showMat的圖和影片檔抓進來的圖(img.data)做比例相加，如果直接把showMat做display應該會只看到一堆顏色，但是原圖的輪廓比如車子馬路人樹等等的東西就都看不到，所以0.4和0.6應該只是隨便挑的一組值，如果想讓輪廓更清楚可以把0.4調大、0.6調小(但原則上相加還是要等於1)，注意最後結果是存在img不是showMat - 65~67: semaphore lock，然後把img push到queue裡面，然後semaphore unlock ```c=1 /** * @brief Read frames into read queue from a video * * @param is_reading - status flag of Read thread * * @return none */ void Read(bool &is_reading) { while (is_reading) { Mat img; if (read_queue.size() < 30) { if (!video.read(img)) { cout << "Finish reading the video." << endl; is_reading = false; break; } mtx_read_queue.lock(); read_queue.push(make_pair(read_index++, img)); mtx_read_queue.unlock(); } else { usleep(20); } } } ``` - 9~23: 當is_reading==true就做 - 11:當queue裡面剩下不到30張圖 - 12~19:讀影片，如果影片讀完就break，反之，就把semaphore lock，然後把讀到的圖push到queue裡，然後semaphore unlock -20~21: 當queue裡面有30張圖以上就usleep ```c=1 /** * @brief Display frames in display queue * * @param is_displaying - status flag of Display thread * * @return none */ void Display(bool &is_displaying) { while (is_displaying) { mtx_display_queue.lock(); if (display_queue.empty()) { if (is_running_1 || is_running_2) { mtx_display_queue.unlock(); usleep(20); } else { is_displaying = false; break; } } else if (display_index == display_queue.top().first) { // Display image imshow("FPN Segmentaion", display_queue.top().second); display_index++; display_queue.pop(); mtx_display_queue.unlock(); if (waitKey(1) == 'q') { is_reading = false; is_running_1 = false; is_running_2 = false; is_displaying = false; break; } } else { mtx_display_queue.unlock(); } } } ``` - 9: is_displaying==true就一直做 - 10: semaphore lock - 11~18: 如果display queue是空的，is_running_1和is_running_2只要有一個或以上是true則sempahore unlock並usleep，若兩個都是false則把is_dispaly設成false並break(is_running_1和is_running_2都是false表示已經不在辨識，所以也就不用在display了，基本上程式也要結束了) - 19~31: 如果display queue裡還有圖，且display_index和display queue的top那張圖的index相同，則display - 21~23: imshow，然後把display_index加一，再把display queue的top給pop掉，最後semaphore unlock - 25~30: 如果中途按鍵盤的q，就會把所有flag設成false並break，也就是結束程式 - 32：如果display queue是空的且display_index和display queue的top那張圖的index不相同，則semaphore unlock ```c=1 int main(int argc, char **argv) { // Check args if (argc != 2) { cout << "Usage of segmentation demo: ./segmentaion file_name[string]" << endl; cout << "\tfile_name: path to your video file" << endl; return -1; } setenv("DPU_COMPILATIONMODE", "1", 1); dpuOpen(); // DPU Kernels/Tasks for runing SSD DPUKernel *kernel; // Create DPU Kernels and Tasks for CONV Nodes in SSD kernel = dpuLoadKernel(KERNEL_CONV); vector<DPUTask *> task(2); generate(task.begin(), task.end(), std::bind(dpuCreateTask, kernel, 0)); // Initializations string file_name = argv[1]; cout << "Detect video: " << file_name << endl; video.open(file_name); if (!video.isOpened()) { cout << "Failed to open video: " << file_name; return -1; } // Run tasks for SSD array<thread, 4> threads = {thread(Read, ref(is_reading)), thread(runSegmentation, task[0], ref(is_running_1)), thread(runSegmentation, task[1], ref(is_running_2)), thread(Display, ref(is_displaying))}; for (int i = 0; i < 4; ++i) { threads[i].join(); } // Destroy DPU Tasks and Kernels and free resources for_each(task.begin(), task.end(), dpuDestroyTask); dpuDestroyKernel(kernel); // Detach from DPU driver and release resources dpuClose(); video.release(); return 0; } ``` - 3~7: 如果程式執行的指令和後面接的argument的個數不是2，則print出程式指令的正確用法 - 9: 照做 - 10: 開啟dpu - 14: 載入dpu kernel(machine code) - 15: 宣告一個vector，裡面包含兩個task - 16: c++的generate用法，相當於resnet50 example中的DPUTask *taskconv = dpuCreateTask(kernelconv, 0)做兩次 - 19: 影片檔名 - 21~24: 開啟影片，如果開不了則程式結束 - 28~31: 宣告四個thread，第一個跑Read函式，第二第三都跑runSegmentation函式(也就是跑dpu task)，第四個跑display函式，請自行google thread用法 - 33~35: join上述四個thread，程式會卡在這邊直到thread執行完後被join(reap) - 39: dpu destroy kernel(想成類似release) - 41: 關閉dpu - 43: video release --- #### dnndk_example/yolov3 ![](https://i.imgur.com/PFaAWRc.png) - test_dnndk_yolov3 ```c=1 #include <assert.h> #include <algorithm> #include <dirent.h> #include <stdio.h> #include <stdlib.h> #include <atomic> #include <sys/stat.h> #include <unistd.h> #include <cassert> #include <chrono> #include <cmath> #include <cstdio> #include <fstream> #include <iomanip> #include <iostream> #include <queue> #include <mutex> #include <string> #include <vector> #include <thread> #include <zconf.h> #include <dnndk/dnndk.h> #include <opencv2/opencv.hpp> #include "utils.h" using namespace cv; using namespace std; using namespace std::chrono; #define NMS_THRESHOLD 0.3f #define KERNEL_CONV "yolov3_adas_512x256" #define INPUT_NODE "layer0_conv" const string outputs_node[4] = {"layer81_conv", "layer93_conv", "layer105_conv", "layer117_conv"}; const string classes[3] = {"car", "person", "cycle"}; int idxInputImage = 0; // frame index of input video int idxShowImage = 0; // next frame index to be displayed bool bReading = true; // flag of reding input frame chrono::system_clock::time_point start_time; typedef pair<int, Mat> imagePair; class paircomp { public: bool operator()(const imagePair &n1, const imagePair &n2) const { if (n1.first == n2.first) { return (n1.first > n2.first); } return n1.first > n2.first; } }; // mutex for protection of input frames queue mutex mtxQueueInput; // mutex for protection of display frmaes queue mutex mtxQueueShow; // input frames queue queue<pair<int, Mat>> queueInput; // display frames queue priority_queue<imagePair, vector<imagePair>, paircomp> queueShow; ``` - 32: 可看出此yolov3 network的input是 512x256 - 34: output node有四個，所以不是用define，而是用array，與前兩個example不同 - 35: 可辨識的class有三個 - 其他部份大致與先前兩個example相同，不贅述 ```c=1 /** * @brief Feed input frame into DPU for process * * @param task - pointer to DPU Task for YOLO-v3 network * @param frame - pointer to input frame * @param mean - mean value for YOLO-v3 network * * @return none */ void setInputImageForYOLO(DPUTask* task, const Mat& frame, float* mean) { Mat img_copy; int height = dpuGetInputTensorHeight(task, INPUT_NODE); int width = dpuGetInputTensorWidth(task, INPUT_NODE); int size = dpuGetInputTensorSize(task, INPUT_NODE); int8_t* data = dpuGetInputTensorAddress(task, INPUT_NODE); image img_new = load_image_cv(frame); image img_yolo = letterbox_image(img_new, width, height); vector<float> bb(size); for(int b = 0; b < height; ++b) { for(int c = 0; c < width; ++c) { for(int a = 0; a < 3; ++a) { bb[b*width*3 + c*3 + a] = img_yolo.data[a*height*width + b*width + c]; } } } float scale = pow(2, 7); for(int i = 0; i < size; ++i) { data[i] = int(bb.data()[i]*scale); if(data[i] < 0) data[i] = 127; } free_image(img_new); free_image(img_yolo); } ``` - 11: 似乎沒用到這個Mat??? - 12~15: 使用dpu相關api取得input的相關資料 - 17: 將原始影片的frame(Mat格式)裡的data，除以256並轉換排列方式(參考下方load_image_cv函式說明)，回傳img_new - 18: 將img_new做不改變長寬比的resize，並填到一個長寬為width和height的image結構變數，然後回傳給img_yolo，也就是說img_yolo的長寬就是dpuGetInputTensorXXXXX所得到的長寬(參考下方letterbox_image函式說明) - 20~27: 宣告一個float的vector bb並把img_yolo的data改變排列(BBB...GGG...RRR... -> BGRBGRBGR...)後丟進bb - 29: scale=128 - 31~33: 把bb的資料乘上128後轉型成int，如果是負的就讓它等於127，丟到dpuGetInputTensorAddress(int8_t* data) - 32: 先前在load_image_cv函式裡除以256，這邊卻乘以128 - 33: 因為data是 int_8*，也就是signed char*，所以如果最高bit是1那就會是負的，把負的都設為127(應該不會發生?) - 上述兩行: 因為原始進來的RGB是各用一個unsigned char(0 ~ 255)表示，但最後進入dpu的卻是int_8*(0 ~ 127)? ```c=1 /** * @brief Thread entry for reading image frame from the input video file * * @param fileName - pointer to video file name * * @return none */ void readFrame(const char *fileName) { static int loop = 3; VideoCapture video; string videoFile = fileName; start_time = chrono::system_clock::now(); while (loop>0) { loop--; if (!video.open(videoFile)) { cout<<"Fail to open specified video file:" << videoFile << endl; exit(-1); } while (true) { usleep(20000); Mat img; if (queueInput.size() < 30) { if (!video.read(img) ) { break; } mtxQueueInput.lock(); queueInput.push(make_pair(idxInputImage++, img)); mtxQueueInput.unlock(); } else { usleep(10); } } video.release(); } exit(0); } ``` - 與segmentation範例大同小異 - 多了一個loop=3變數，會使得影片被讀三次，也就是同一隻影片跑三次程式才結束 ```c=1 /** * @brief Thread entry for displaying image frames * * @param none * @return none * */ void displayFrame() { Mat frame; while (true) { mtxQueueShow.lock(); if (queueShow.empty()) { mtxQueueShow.unlock(); usleep(10); } else if (idxShowImage == queueShow.top().first) { auto show_time = chrono::system_clock::now(); stringstream buffer; frame = queueShow.top().second; auto dura = (duration_cast<microseconds>(show_time - start_time)).count(); buffer << fixed << setprecision(1) << (float)queueShow.top().first / (dura / 1000000.f); string a = buffer.str() + " FPS"; cv::putText(frame, a, cv::Point(10, 15), 1, 1, cv::Scalar{240, 240, 240},1); cv::imshow("YoloV3 Detection", frame); idxShowImage++; queueShow.pop(); mtxQueueShow.unlock(); if (waitKey(1) == 'q') { bReading = false; exit(0); } } else { mtxQueueShow.unlock(); } } } ``` - 和segmentation範例基本上一樣，多了一些計算時間並把時間show在osd上的程式 ```c=1 /** * @brief Post process after the runing of DPU for YOLO-v3 network * * @param task - pointer to DPU task for running YOLO-v3 * @param frame * @param sWidth * @param sHeight * * @return none */ void postProcess(DPUTask* task, Mat& frame, int sWidth, int sHeight){ vector<vector<float>> boxes; for(int i = 0; i < 4; i++){ string output_node = outputs_node[i]; int channel = dpuGetOutputTensorChannel(task, output_node.c_str()); int width = dpuGetOutputTensorWidth(task, output_node.c_str()); int height = dpuGetOutputTensorHeight(task, output_node.c_str()); int sizeOut = dpuGetOutputTensorSize(task, output_node.c_str()); int8_t* dpuOut = dpuGetOutputTensorAddress(task, output_node.c_str()); float scale = dpuGetOutputTensorScale(task, output_node.c_str()); vector<float> result(sizeOut); boxes.reserve(sizeOut); /* Store every output node results */ get_output(dpuOut, sizeOut, scale, channel, height, width, result); /* Store the object detection frames as coordinate information */ detect(boxes, result, channel, height, width, i, sHeight, sWidth); } /* Restore the correct coordinate frame of the original image */ correct_region_boxes(boxes, boxes.size(), frame.cols, frame.rows, sWidth, sHeight); /* Apply the computation for NMS */ vector<vector<float>> res = applyNMS(boxes, classificationCnt, NMS_THRESHOLD); float h = frame.rows; float w = frame.cols; for(size_t i = 0; i < res.size(); ++i) { float xmin = (res[i][0] - res[i][2]/2.0) * w + 1.0; float ymin = (res[i][1] - res[i][3]/2.0) * h + 1.0; float xmax = (res[i][0] + res[i][2]/2.0) * w + 1.0; float ymax = (res[i][1] + res[i][3]/2.0) * h + 1.0; if(res[i][res[i][4] + 6] > CONF ) { int type = res[i][4]; string classname = classes[type]; Point origin; origin.x = xmin; origin.y = ymin; if (type==0) { rectangle(frame, cvPoint(xmin, ymin), cvPoint(xmax, ymax), Scalar(0, 0, 255), 1, 1, 0); putText(frame, classname, origin, FONT_HERSHEY_PLAIN, 1, Scalar(0, 0, 255), 1, 4); } else if (type==1) { rectangle(frame, cvPoint(xmin, ymin), cvPoint(xmax, ymax), Scalar(255, 0, 0), 1, 1, 0); putText(frame, classname, origin, FONT_HERSHEY_PLAIN, 1, Scalar(255, 0, 0), 1, 4); } else { rectangle(frame, cvPoint(xmin, ymin), cvPoint(xmax, ymax), Scalar(0 ,255, 255), 1, 1, 0); putText(frame, classname, origin, FONT_HERSHEY_PLAIN, 1, Scalar(0, 255, 255), 1, 4); } } } } ``` - 後處理 - 14~31: 這個yolov3網路有四個output(一般yolov3似乎只有三個?)，針對四個output做處理 - [reference](https://medium.com/@chih.sheng.huang821/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%ACyolov1-yolov2%E5%92%8Cyolov3-cfg-%E6%AA%94%E8%A7%A3%E8%AE%80-75793cd61a01) - 27: 把dpu的output(dpuout)處理後放到result(參考get_output函式說明) - 30: 把result處理後放到boxes(參考detect函式說明) - 34: 將boxes裡的座標和寬高對應回原始影像上(參考correct_region_boxes函式說明) - 37: 將boxes做nonmax然後回傳道res - 41~66: 畫框和show類別的文字 - 42~45: 框的左右上下 - 47~64: res[i][4]放的是屬於哪個class(0或1或2)， res[i][class+6]表示該類的機率，若該類機率>CONF，就畫框和以及putText類別名稱 ```c=1 /** * @brief Thread entry for running YOLO-v3 network on DPU for acceleration * * @param task - pointer to DPU task for running YOLO-v3 * * @return none */ void runYOLO(DPUTask* task) { /* mean values for YOLO-v3 */ float mean[3] = {0.0f, 0.0f, 0.0f}; int height = dpuGetInputTensorHeight(task, INPUT_NODE); int width = dpuGetInputTensorWidth(task, INPUT_NODE); while (true) { pair<int, Mat> pairIndexImage; mtxQueueInput.lock(); if (queueInput.empty()) { mtxQueueInput.unlock(); if (bReading) { continue; } else { break; } } else { /* get an input frame from input frames queue */ pairIndexImage = queueInput.front(); queueInput.pop(); mtxQueueInput.unlock(); } vector<vector<float>> res; /* feed input frame into DPU Task with mean value */ setInputImageForYOLO(task, pairIndexImage.second, mean); /* invoke the running of DPU for YOLO-v3 */ dpuRunTask(task); postProcess(task, pairIndexImage.second, width, height); mtxQueueShow.lock(); /* push the image into display frame queue */ queueShow.push(pairIndexImage); mtxQueueShow.unlock(); } } ``` - 和segmentation差不多 - 35: setInputImageForYOLO函式把原圖Mat經過一堆處理後放到DPU的input位址(用for loop一個一個填值) - 40: 做後處理 ```c=1 int main(const int argc, const char** argv) { if (argc != 2) { cout << "Usage of YoloV3 detection: ./test_dnndk_yolov3 video-file" << endl; return -1; } setenv("DPU_COMPILATIONMODE","1",1); /* The main procress of using DPU kernel begin. */ DPUKernel *kernel; dpuOpen(); // Create the kernel kernel = dpuLoadKernel(KERNEL_CONV); vector<DPUTask *> task(5); /* Create 4 DPU Tasks for YOLO-v3 network model */ generate(task.begin(), task.end(), std::bind(dpuCreateTask, kernel, 0)); /* Spawn 6 threads: - 1 thread for reading video frame - 4 identical threads for running YOLO-v3 network model - 1 thread for displaying frame in monitor */ array<thread, 7> threadsList = { thread(readFrame, argv[1]), thread(displayFrame), thread(runYOLO, task[0]), thread(runYOLO, task[1]), thread(runYOLO, task[2]), thread(runYOLO, task[3]), thread(runYOLO, task[4]), }; for (int i = 0; i < 7; i++) { threadsList[i].join(); } /* Destroy DPU Tasks & free resources */ for_each(task.begin(), task.end(), dpuDestroyTask); // Destroy the kernel after classification dpuDestroyKernel(kernel); dpuClose(); /* The main procress of using DPU kernel end. */ return 0; } ``` - 與segmentation作法相同，thread變成1個read、1個display、5個做dpu task(runYOLO函式) - utils.h ```c=1 #include <algorithm> #include <iomanip> #include <iosfwd> #include <memory> #include <string> #include <utility> #include <vector> #include <math.h> using namespace std; using namespace std::chrono; #define CONF 0.5 const int classificationCnt = 3; const int anchorCnt = 5; typedef struct { int w; int h; int c; float *data; } image; image load_image_cv(const cv::Mat& img); image letterbox_image(image im, int w, int h); void free_image(image m); ``` - 15: 能分三類 - 16: anchorCnt是5? - 18~23: 自行定義一個image，存w、h、c(channel，RGB的話channel就是3) ```c=1 inline float sigmoid(float p) { return 1.0 / (1 + exp(-p * 1.0)); } ``` - 做sigmoid ```c=1 inline float overlap(float x1, float w1, float x2, float w2) { float left = max(x1 - w1 / 2.0, x2 - w2 / 2.0); float right = min(x1 + w1 / 2.0, x2 + w2 / 2.0); return right - left; } ``` - cal_iou函式有用到 - 2: x1是中心x座標，w1是寬度，所以x1-w1/2.0就是左邊界，x2和w2同理，所以就是算出兩個左邊界然後取大的 - 3: 算出兩個右邊界，取小的 - 4: 回傳左右相減 ```c=1 inline float cal_iou(vector<float> box, vector<float>truth) { float w = overlap(box[0], box[2], truth[0], truth[2]); float h = overlap(box[1], box[3], truth[1], truth[3]); if(w < 0 || h < 0) return 0; float inter_area = w * h; float union_area = box[2] * box[3] + truth[2] * truth[3] - inter_area; return inter_area * 1.0 / union_area; } ``` - applyNMS函式有用到 - 2: 算出交疊區域的寬 - 3: 算出交疊區域的高 - 4: 交疊的寬高只要有任一個小於零就回傳0 -> 表示沒交疊 - 6: inter_area = 交疊區域面積 - 7: union_area = box面積 + truth面積 - 交疊面積 - 8: 會傳 inter_area和 union_area的比(也就是交疊面積佔了兩個box總面積的多少) ```c=1 void correct_region_boxes(vector<vector<float>>& boxes, int n, int w, int h, int netw, int neth, int relative = 0) { int new_w=0; int new_h=0; if (((float)netw/w) < ((float)neth/h)) { new_w = netw; new_h = (h * netw)/w; } else { new_h = neth; new_w = (w * neth)/h; } for (int i = 0; i < n; ++i){ boxes[i][0] = (boxes[i][0] - (netw - new_w)/2./netw) / ((float)new_w/(float)netw); boxes[i][1] = (boxes[i][1] - (neth - new_h)/2./neth) / ((float)new_h/(float)neth); boxes[i][2] *= (float)netw/new_w; boxes[i][3] *= (float)neth/new_h; } } ``` - 因為先前有把原始影像將過一些縮放在給yolov3處理，這邊是要還原回去(下方letterbox_image函式有類似作法，請參考) - 14~15: 還原x座標和y座標 -> 先位移後乘以一個比例 - 16~17: 還原x座標和y座標 -> 乘以一個比例 ```c=1 void detect(vector<vector<float>> &boxes, vector<float> result, int channel, int height, int weight, int num, int sh, int sw); void detect(vector<vector<float>> &boxes, vector<float> result, int channel, int height, int width, int num, int sHeight, int sWidth) { vector<float> biases{ 123,100, 167,83, 98,174, 165,158, 347,98, 76,37, 40,97, 74,64, 105,63, 66,131,18,46, 33,29, 47,23, 28,68, 52,42, 5.5,7, 8,17, 14,11, 13,29, 24,17 }; int conf_box = 5 + classificationCnt; float swap[height * width][anchorCnt][conf_box]; for (int h = 0; h < height; ++h) { for (int w = 0; w < width; ++w) { for (int c = 0; c < channel; ++c) { int temp = c * height * width + h * width + w; swap[h * width + w][c / conf_box][c % conf_box] = result[temp]; } } } for (int h = 0; h < height; ++h) { for (int w = 0; w < width; ++w) { for (int c = 0; c < anchorCnt; ++c) { float obj_score = sigmoid(swap[h * width + w][c][4]); if (obj_score < CONF) continue; vector<float> box; box.push_back((w + sigmoid(swap[h * width + w][c][0])) / width); box.push_back((h + sigmoid(swap[h * width + w][c][1])) / height); box.push_back(exp(swap[h * width + w][c][2]) * biases[2 * c + 10 * num] / float(sWidth)); box.push_back(exp(swap[h * width + w][c][3]) * biases[2 * c + 10 * num + 1] / float(sHeight)); box.push_back(-1); box.push_back(obj_score); for (int p = 0; p < classificationCnt; p++) { box.push_back(obj_score * sigmoid(swap[h * width + w][c][5 + p])); } boxes.push_back(box); } } } } ``` - 9: conf_box: 5代表dpu最後輸出的tx,ty以及tw和th在加上confidence score(或叫object confidence或object score)，classificationCnt是分類數，在這個model是3，所以預留三個空間放網路輸出的個類別機率 - 10~19: 把result裡的東西放到三維的array swap[height * width][anchorCnt][conf_box]，其中anchorCnt是5 - 20~21: 把swap裡的東西處理過後放到boxes - 23: obj_score就是confidence score - 24~25: obj_score太小的就直接跳過 - 26: 宣告一個暫存用的box - 28: 用tx回推出在model input(512x256)中的x座標(預測的x座標)，並push到box - 29: 用ty回推出在model input(512x256)中的y座標(預測的y座標)，並push到box - 30: 用tw回推出在model input(512x256)中的寬度(預測的寬度)，並push到box - 31: 用th回推出在model input(512x256)中的寬度(預測的高度)，並push到box - 32: push -1 到box(後面applyNMS函式會用到這個) - 33: push obj_score到box - 34~36: 把各個分類類別的機率求出(以此model來說for loop做三次，因為分三類)，算法是obj)score x sigmoid(網路輸出機率) - 37: 把box push到boxes(boxes就是這個函式最終的輸出) - [yolov3 reference1](https://zhuanlan.zhihu.com/p/50595699) - [yolov3 reference2](https://mropengate.blogspot.com/2018/06/yolo-yolov3.html) - [yolov3 reference3](https://medium.com/@chih.sheng.huang821/%E6%B7%B1%E5%BA%A6%E5%AD%B8%E7%BF%92-%E7%89%A9%E4%BB%B6%E5%81%B5%E6%B8%ACyolov1-yolov2%E5%92%8Cyolov3-cfg-%E6%AA%94%E8%A7%A3%E8%AE%80-75793cd61a01) - [yolov3 reference4](https://github.com/pjreddie/darknet/issues/568) ```c=1 vector<vector<float>> applyNMS(vector<vector<float>>& boxes,int classes, const float thres) { vector<pair<int, float>> order(boxes.size()); vector<vector<float>> result; for(int k = 0; k < classes; k++) { for (size_t i = 0; i < boxes.size(); ++i) { order[i].first = i; boxes[i][4] = k; order[i].second = boxes[i][6 + k]; } sort(order.begin(), order.end(), [](const pair<int, float> &ls, const pair<int, float> &rs) { return ls.second > rs.second; }); vector<bool> exist_box(boxes.size(), true); for (size_t _i = 0; _i < boxes.size(); ++_i) { size_t i = order[_i].first; if (!exist_box[i]) continue; if (boxes[i][6 + k] < CONF) { exist_box[i] = false; continue; } /* add a box as result */ result.push_back(boxes[i]); for (size_t _j = _i + 1; _j < boxes.size(); ++_j) { size_t j = order[_j].first; if (!exist_box[j]) continue; float ovr = cal_iou(boxes[j], boxes[i]); if (ovr >= thres) exist_box[j] = false; } } } return result; } ``` - 做nonmax - 3: 宣告一個 result - 5~33: 針對每個class(分類)做一次，此model有3個，所以for loop做三次(k=0,1,2) - 6~10: order[i].first放i，當作index， boxes[i][4]設成k(在detect被設成-1的那個)，order[i].second放相對應的類別的機率 - 11~12: 把order依照order.second(也就是機率)的大小進行重排(後面會從機率大開始看) - 14: 宣告exist_box，用來當boxex的flag - 16~32: 針對每一個order[_i] - 17: i=order[_i].first，也就是boxes index(依機率大到小去檢查boxes) - 18: 如果exist_box[i]被設成false了，就不用往下做，直接continue - 19~22: 如果boxes[i]第k類的機率小於CONF就把exist_box[i]設成flase，直接continue - 24: 到這一行都沒有continue，就把boxes[i] push_back到 result - 26~31: 把針對boxes[i]去比對所有後面的boxes，如果overlap到一定程度，就把相對應的exist_box設成false - 27: j=order[_j].first - 28: 如果exist_box[j]被設成false了，就不用往下做，直接continue - 29: 如果28行沒有continue，就呼叫cal_iou看看boxes[i]和boxes[j]有沒有太overlap - 30: 如果ovr(想成overlap的程度)大於等於thres則 exist_box[j]也設成false - 35: 回傳result ```c=1 void get_output(int8_t* dpuOut, int sizeOut, float scale, int oc, int oh, int ow, vector<float>& result) { vector<int8_t> nums(sizeOut); memcpy(nums.data(), dpuOut, sizeOut); for(int a = 0; a < oc; ++a){ for(int b = 0; b < oh; ++b){ for(int c = 0; c < ow; ++c) { int offset = b * oc * ow + c * oc + a; result[a * oh * ow + b * ow + c] = nums[offset] * scale; } } } } ``` - 把dpuOut的data乘上scale(要乘scale，是因為model有做quantization(及float to fix))並存到result - 看起來dpuOut的資料擺放方式是類似array[height][weight][channel]，轉換程array[channel][Height][Width] ```c=1 static float get_pixel(image m, int x, int y, int c) { assert(x < m.w && y < m.h && c < m.c); return m.data[c*m.h*m.w + y*m.w + x]; } ``` - 在image的data中的特定一個位置取值(或說取一個pixel的值) - assert是保護 ```c=1 static void set_pixel(image m, int x, int y, int c, float val) { if (x < 0 || y < 0 || c < 0 || x >= m.w || y >= m.h || c >= m.c) return; assert(x < m.w && y < m.h && c < m.c); m.data[c*m.h*m.w + y*m.w + x] = val; } ``` - 在image的data中的特定一個位置給值(或說設定一個pixel的值) - assert是保護 ```c=1 static void add_pixel(image m, int x, int y, int c, float val) { assert(x < m.w && y < m.h && c < m.c); m.data[c*m.h*m.w + y*m.w + x] += val; } ``` - 把image的data中的特定一個位置的值加上一個值(例如原來data中該位置的值是0.3，做完後變成0.3+val) - assert是保護 ```c=1 image make_empty_image(int w, int h, int c) { image out; out.data = 0; out.h = h; out.w = w; out.c = c; return out; } ``` - 創一個image，但沒有配置data空間 ```c=1 void free_image(image m) { if(m.data){ free(m.data); } } ``` - free掉image的data ```c=1 image make_image(int w, int h, int c) { image out = make_empty_image(w,h,c); out.data = (float*) calloc(h*w*c, sizeof(float)); return out; } ``` - 創一個w*h*c的image，並為data配置一塊都是0的記憶體空間 ```c=1 void fill_image(image m, float s) { int i; for(i = 0; i < m.h*m.w*m.c; ++i) m.data[i] = s; } ``` - 把image m裡面全部填s ```c=1 void embed_image(image source, image dest, int dx, int dy) { int x,y,k; for(k = 0; k < source.c; ++k){ for(y = 0; y < source.h; ++y){ for(x = 0; x < source.w; ++x){ float val = get_pixel(source, x,y,k); set_pixel(dest, dx+x, dy+y, k, val); } } } } ``` - 把source填到dest中 - 想像成source的(0,0)對到dest的(dx,dy)然後貼上去 ```c=1 void ipl_into_image(IplImage* src, image im) { unsigned char *data = (unsigned char *)src->imageData; int h = src->height; int w = src->width; int c = src->nChannels; int step = src->widthStep; int i, j, k; for(i = 0; i < h; ++i){ for(k= 0; k < c; ++k){ for(j = 0; j < w; ++j){ im.data[k*w*h + i*w + j] = data[i*step + j*c + k]/256.; } } } } ``` - 程式沒有被呼叫，但程式的內容在後面load_image_cv函式有出現 ```c=1 image ipl_to_image(IplImage* src) { int h = src->height; int w = src->width; int c = src->nChannels; image out = make_image(w, h, c); ipl_into_image(src, out); return out; } ``` - 沒用到不用看 ```c=1 void rgbgr_image(image im) { int i; for(i = 0; i < im.w*im.h; ++i){ float swap = im.data[i]; im.data[i] = im.data[i+im.w*im.h*2]; im.data[i+im.w*im.h*2] = swap; } } ``` - 程式沒有被呼叫，但程式的內容在後面load_image_cv函式有出現 ```c=1 image resize_image(image im, int w, int h) { image resized = make_image(w, h, im.c); image part = make_image(w, im.h, im.c); int r, c, k; float w_scale = (float)(im.w - 1) / (w - 1); float h_scale = (float)(im.h - 1) / (h - 1); for(k = 0; k < im.c; ++k){ for(r = 0; r < im.h; ++r){ for(c = 0; c < w; ++c){ float val = 0; if(c == w-1 || im.w == 1){ val = get_pixel(im, im.w-1, r, k); } else { float sx = c*w_scale; int ix = (int) sx; float dx = sx - ix; val = (1 - dx) * get_pixel(im, ix, r, k) + dx * get_pixel(im, ix+1, r, k); } set_pixel(part, c, r, k, val); } } } for(k = 0; k < im.c; ++k){ for(r = 0; r < h; ++r){ float sy = r*h_scale; int iy = (int) sy; float dy = sy - iy; for(c = 0; c < w; ++c){ float val = (1-dy) * get_pixel(part, c, iy, k); set_pixel(resized, c, r, k, val); } if(r == h-1 || im.h == 1) continue; for(c = 0; c < w; ++c){ float val = dy * get_pixel(part, c, iy+1, k); add_pixel(resized, c, r, k, val); } } } free_image(part); return resized; } ``` - 使用內插做resize(過於細節的部份不討論) - 8~23: 每個channel，做水平resize，存到part - 24~39: 每個channel，做垂直resize，存到resized ```c=1 image load_image_cv(const cv::Mat& img) { int h = img.rows; int w = img.cols; int c = img.channels(); image im = make_image(w, h, c); unsigned char *data = img.data; for(int i = 0; i < h; ++i){ for(int k= 0; k < c; ++k){ for(int j = 0; j < w; ++j){ im.data[k*w*h + i*w + j] = data[i*w*c + j*c + k]/256.; } } } for(int i = 0; i < im.w*im.h; ++i){ float swap = im.data[i]; im.data[i] = im.data[i+im.w*im.h*2]; im.data[i+im.w*im.h*2] = swap; } return im; } ``` - 把Mat的圖除以scale(256.)，存到image im，在Mat中data排序是RGBRGBRGB...，放到im的時候是先放BBB...整張放玩後，再放GGG...整張放完，最後再放RRR... (同array[H][W][channel] -> array[channel][H][W]) - 9~15: RGBRGBRGB... -> RRR...GGG...BBB...，且除以scale(256.0) - 17~21: R和B順序對調，及 RRR...GGG...BBB... -> BBB...GGG...RRR... ```c=1 image letterbox_image(image im, int w, int h) { int new_w = im.w; int new_h = im.h; if (((float)w/im.w) < ((float)h/im.h)) { new_w = w; new_h = (im.h * w)/im.w; } else { new_h = h; new_w = (im.w * h)/im.h; } image resized = resize_image(im, new_w, new_h); image boxed = make_image(w, h, im.c); fill_image(boxed, .5); embed_image(resized, boxed, (w-new_w)/2, (h-new_h)/2); free_image(resized); return boxed; } ``` - 1: 函式的parameter "int w" 和 "int h" 表示yolov3網路的input大小，也就是512和256 - 3~4: 函式內的取得原始影片的大小，存在new_w和new_h - 5~11: new_w與new_h為im長寬的等比縮放，且縮放後要比wxh(512x256)來的小 - 如果原圖是512x512則 new_w=256 new_h=256；如果原圖是512x1024則 new_w=128 new_h=256 - 12: 把原圖resize成new_w x new_h - 13~14: 宣告一個image boxed(長寬是wxh，及512x256)並全部填0.5 - 16: resize的圖填到boxed的中間(沒填到的部份就是0.5) --- #### framebuffer驅動使用(/dev/fbX，X是數字) - 將dnndk_example/segmentation改寫為framebuffer輸出，取代imshow，後續奇美的linux作業系統應該不會有gtk(Xwindow)，用imshow會有問題 - 沒修改的部份不特別列出，請自行參考dnndk_example/segmentation說明 - 沒有做優化整理，僅實現功能 ```c=1 ///////kaiden//////start #include <stdio.h> #include <unistd.h> #include <stdlib.h> #include <fcntl.h> #include <sys/stat.h> #include <errno.h> #include <linux/fb.h> #include <sys/ioctl.h> #include <string.h> #include <sys/mman.h> #define FB0 "/dev/fb0" int fd = 0; char *fd_map = NULL; // 固定参数 struct fb_fix_screeninfo fix_info; // 可变参数 struct fb_var_screeninfo var_info; int ret = -1; int screen_size = 0; ///////kaiden//////end ``` - 加入一些相關的include和變數定義 ```c=1 int frame_buffer_initail() { /////////////////////////////////kaiden//////////////////////////////////////start memset(&fix_info, 0, sizeof(struct fb_fix_screeninfo)); memset(&var_info, 0, sizeof(struct fb_var_screeninfo)); fd = open(FB0, O_RDWR); if(fd < 0) { char *error_msg = strerror(errno); printf("Open %s failed, errno:%d, error message:%s\n", FB0, errno, error_msg); return -1; } // get varied info ret = ioctl(fd, FBIOGET_VSCREENINFO, &var_info); if(ret < 0) { char *error_msg = strerror(errno); printf("Get %s var info error, errno:%d, error message:%s\n", FB0, errno, error_msg); return ret; } // get fix info ret = ioctl(fd, FBIOGET_FSCREENINFO, &fix_info); if(ret < 0) { char *error_msg = strerror(errno); printf("Get %s fix info error, errno:%d, error message:%s\n", FB0, errno, error_msg); return ret; } printf("%s var info, xres=%d, yres=%d\n", FB0, var_info.xres, var_info.yres); printf("%s var info, xres_virtual=%d, yres_virtual=%d\n", FB0, var_info.xres_virtual, var_info.yres_virtual); printf("%s var info, bits_per_pixel=%d\n", FB0, var_info.bits_per_pixel); printf("%s var info, xoffset=%d, yoffset=%d\n", FB0, var_info.xoffset, var_info.yoffset); printf("r_len=%d,r_off=%d,g_len=%d,g_off=%d,b_len=%d,b_off=%d" ,var_info.red.length ,var_info.red.offset ,var_info.green.length ,var_info.green.offset ,var_info.blue.length ,var_info.blue.offset); #if 0 // try to modify RGB565 to RGB888 but failed var_info.bits_per_pixel = 24; ret = ioctl(fd, FBIOPUT_VSCREENINFO, &var_info); if (ret) printf("Error\n"); else printf("Bits per pixel set\n"); printf("%s var info, xres=%d, yres=%d\n", FB0, var_info.xres, var_info.yres); printf("%s var info, xres_virtual=%d, yres_virtual=%d\n", FB0, var_info.xres_virtual, var_info.yres_virtual); printf("%s var info, bits_per_pixel=%d\n", FB0, var_info.bits_per_pixel); printf("%s var info, xoffset=%d, yoffset=%d\n", FB0, var_info.xoffset, var_info.yoffset); printf("r_len=%d,r_off=%d,g_len=%d,g_off=%d,b_len=%d,b_off=%d" ,var_info.red.length ,var_info.red.offset ,var_info.green.length ,var_info.green.offset ,var_info.blue.length ,var_info.blue.offset); #endif screen_size = var_info.xres * var_info.yres * var_info.bits_per_pixel / 8; // mmap frame buffer to user space fd_map = (char *)mmap(NULL, screen_size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); if(fd_map == (char *)-1) { char *error_msg = strerror(errno); printf("Mmap %s failed, errno:%d, error message:%s\n", FB0, errno, error_msg); return -1; } //munmap(fd_map, screen_size); //close(fd); /////////////////////////////////kaiden//////////////////////////////////////end } ``` - 加入一個函式frame_buffer_initail，做framebuffer初始化 ```c=1 int main(int argc, char **argv) { // Check args if (argc != 2) { cout << "Usage of segmentation demo: ./segmentaion file_name[string]" << endl; cout << "\tfile_name: path to your video file" << endl; return -1; } frame_buffer_initial(); //kaiden setenv("DPU_COMPILATIONMODE", "1", 1); dpuOpen(); ... ... ... dpuClose(); video.release(); munmap(fd_map, screen_size); //kaiden return 0; } ``` - 9: main函式呼叫frame_buffer_initial - 22: unmap ```c=1 void Display(bool &is_displaying) { while (is_displaying) { mtx_display_queue.lock(); if (display_queue.empty()) { if (is_running_1 || is_running_2) { mtx_display_queue.unlock(); usleep(20); } else { is_displaying = false; break; } } else if (display_index == display_queue.top().first) { // Display image //imshow("FPN Segmentaion", display_queue.top().second); ... ... ... } ``` - 14: 把Display函式imshow註解掉 ```c=1 void runSegmentation(DPUTask *task, bool &is_running) { // initialize the task's parameters DPUTensor *conv_in_tensor = dpuGetInputTensor(task, CONV_INPUT_NODE); int inHeight = dpuGetTensorHeight(conv_in_tensor); int inWidth = dpuGetTensorWidth(conv_in_tensor); DPUTensor *conv_out_tensor = dpuGetOutputTensor(task, CONV_OUTPUT_NODE); int outHeight = dpuGetTensorHeight(conv_out_tensor); int outWidth = dpuGetTensorWidth(conv_out_tensor); int8_t *outTensorAddr = dpuGetTensorAddress(conv_out_tensor); // Run detection for images in read queue while (is_running) { // Get an image from read queue ... ... ... // Put image into display queue mtx_display_queue.lock(); display_queue.push(make_pair(index, img)); mtx_display_queue.unlock(); unsigned char* ptr = img.data; for(unsigned int i = 0; i < inHeight; i++) { for(int j = 0; j < inWidth; j++) { unsigned char b_tmp = (*(ptr+j*3+i*inWidth*3+0))>>3; unsigned char g_tmp = (*(ptr+j*3+i*inWidth*3+1))>>2; unsigned char r_tmp = (*(ptr+j*3+i*inWidth*3+2))>>3; unsigned short bgr=(r_tmp<<11)+(g_tmp<<5)+b_tmp; *((unsigned short *)(fd_map + j*2 + i * var_info.xres * 2 )) = bgr; } } } } ``` - 直接在runSegmentation函式中把要輸出的圖搬到framebuffer - 27~40: 搬到framebuffer，注意這是RGB565(16bits)格式