Team 2
瞿旭民 <kevinbird61>
林哲亘 <CheHsuan>
謝永勁 <SwimGlass>
陳柔安 <carolc0708>
github source:
https://github.com/kevinbird61/Image-Processing
digraph {
intro[shape="box", style=rounded];
intro[label="介紹 image-processing"];
intro->env;
env[shape="box", style=rounded];
env[label="Intel & ARM NEON 實作"];
env->bmp;
bmp[label="Why choose BMP ? + About DSs"];
bmp->program;
program[shape="box", style=rounded];
program[label="程式架構與執行"];
program->optimize;
optimize[shape="box", style=rounded];
optimize[label="優化方式"];
optimize->result;
result[shape="box", style=rounded];
result[label="程式執行結果"];
optimize->check;
check[shape="box", style=rounded];
check[label="說明驗證方法"];
optimize->future;
future[shape="box", style=rounded];
future[label="future work"];
}
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 8
On-line CPU(s) list: 0-7
每核心執行緒數:2
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
Model name: Intel(R) Core(TM) i7-4700HQ CPU @ 2.40GHz
CPU MHz: 3199.968
CPU max MHz: 3400.0000
CPU min MHz: 800.0000
BogoMIPS: 4789.25
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 6144K
NUMA node0 CPU(s): 0-7
Architecture: armv7l
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 1
Core(s) per socket: 4
Socket(s): 1
Model name: ARMv7 Processor rev 4 (v7l)
CPU max MHz: 1200.0000
CPU min MHz: 600.0000
高斯模糊,又稱為高斯平滑,主要用於降低圖片細節與層次。
如上圖,我們以一點作為運算目標,距離該點愈遠的,所佔的影響值愈小。
權重分佈則採常態分佈(又稱正態分佈、高斯分佈)方式,距離愈近 μ ,值愈大;而距離( σ:標準差 )愈小,則代表愈集中、愈大則代表愈分散
Gaussian Kernel
來使用NEON是一種基於SIMD的ARM技術,比較ARMv6或之前的架構,NEON结合了64-bit和128-bit的兩種SIMD指令集,提供128-bit的向量運算(vector operations)。NEON從ARMv7開始被使用,目前可以在ARM Cortex-A和Cortex-R系列處理器中使用。
// Original structure typedef struct tagRGBTRIPLE { // (3bytes) BYTE rgbBlue; // (1bytes) blue channel BYTE rgbGreen; // (1bytes) green channel BYTE rgbRed; // (1bytes) red channel } RGBTRIPLE; // split structure color_r = (unsigned char*)malloc(bmpInfo.biWidth*bmpInfo.biHeight*sizeof(unsigned char)); color_g = (unsigned char*)malloc(bmpInfo.biWidth*bmpInfo.biHeight*sizeof(unsigned char)); color_b = (unsigned char*)malloc(bmpInfo.biWidth*bmpInfo.biHeight*sizeof(unsigned char));
naive(ori) | naive(tri) | |
高斯模糊(ms) | 320.122994 | 441.310818 |
可以看到,由於以 original data structure 只執行了一次 blur function,比較 split data structure 來說,相對少了兩次的 blur function call ;而兩種實作方式就帶來0.073597
秒的差距
而為何我要實作兩種,主要考慮到之後實作 SSE/AVX instruction set 時所需要的操作,可能會為了符合一個 register 大小(128/256)而做出調整,儘量使用到所有空間,減少回圈執行次數。
digraph {
start->Makefile;
start->shell;
start[shape="box", style=rounded];
start[label="執行程式(提供兩種方式)"];
shell[label="shell script"];
}
$ bash image_process.sh -g 1
Your getopt version is OK!
compile with target gaussian: 1
The program will be compile with:
Git commit hooks are installed successfully.
未改變 /home/kevin/Dropbox/workspace/Image-Processing/bmp.h
未改變 /home/kevin/Dropbox/workspace/Image-Processing/gaussian.c
未改變 /home/kevin/Dropbox/workspace/Image-Processing/gaussian.h
未改變 /home/kevin/Dropbox/workspace/Image-Processing/hsv.c
未改變 /home/kevin/Dropbox/workspace/Image-Processing/hsv.h
未改變 /home/kevin/Dropbox/workspace/Image-Processing/main.c
未改變 /home/kevin/Dropbox/workspace/Image-Processing/mirror_arm.c
未改變 /home/kevin/Dropbox/workspace/Image-Processing/mirror.c
未改變 /home/kevin/Dropbox/workspace/Image-Processing/mirror.h
[Prepared to execute...]Enter the execution times your want to apply on image:[Press Enter to default=1] 5
[Prepared to execute...]Enter the thread numbers your want to apply on image:[Press Enter to default=2] 4
[Prepared to execute...]If you want to change input filename(Press Enter to get default)?[y/n]
[Prepared to execute...]If you want to change output filename(Press Enter to get default)?[y/n]
Picture size of picture is width: 1600 , height 1200
Read file successfully
Gaussian blur[5x5][sse pthread original structure], execution time : 484.307227 ms , with 5 times Gaussian blur
Save file successfully
...
bash image_process.sh [...]
做執行手段針對原有 naive 的操作或是DS做的改變
加上額外的函式庫作為加速手法
變更演算法
naive Split | naive Original | naive Unroll split | naive Unroll original | |
執行1次(ms) | 429.951608 | 330.506621 | 188.229402 | 251.948595 |
231.528868 ms
,除以 3 後大約為77.18 ms
;而看到 original 前後時間差異:78.322224 ms
,差不多相同,代表著由於 unroll
對於實驗體( 1600 x 1200 的圖片檔)來說,對一次執行實作 unroll
的效能增加差不多為 77~78 ms
naive Split | naive Expand Split | |
執行1次(ms) | 429.951608 | 390.830719 |
2D filter Unroll | 1D filter Unroll | |
執行1次(ms) | 174.879173 | 139.384629 |
naive Split | naive Original | SSE Split | SSE Original | |
執行1次(ms) | 429.951608 | 330.506621 | 294.354542 | 275.450552 |
unroll Split | Pthread Split | |
執行1次(ms) | 164.484354 | 103.175455 |
需要注意到的事情:
不能自己去預設"電腦"的運作方式
SSE Original | Pthread SSE Original | |
執行1次(ms) | 232.851845 | 69.630850 |
在 raspberry pi 3 上面的執行結果如下
Naive | NEON | |
執行1次(ms) | 823.816812 | 181.652232 |
翻轉 | naive(ori) | naive(tri) | openmp(tri) | sse(tri) |
水平翻轉(ms) | 8.165160 | 16.725438 | 4.859310 | 4.280531 |
垂直翻轉(ms) | 6.096713 | 16.969292 | 6.688112 | 1.869409 |
翻轉 | naive_arm(ori) | naive_arm(tri) | neon(tri) |
水平翻轉(ms) | 22.934580 | 37.172171 | 10.752292 |
垂直翻轉(ms) | 19.200639 | 8.979229 | 7.470433 |
檢查兩個產生的圖檔 binary 是否相同
kevin :-$ diff output_4_0_0.bmp output_8_0_0.bmp
kevin :-$
...
kevin :-$ diff output_4_0_0.bmp output_128_0_0.bmp
二元碼檔 output_4_0_0.bmp 與 output_128_0_0.bmp 不同
...
若沒有任何顯示,則代表兩個 binary 圖檔相同
而通常造成不同的原因:
(以 Expand 為例 )使用單一點的矩陣來執行 gaussian blur 後,可以直接檢視產生的矩陣,在該點位置周圍是否為正確的值
0,0,0,0,0 | 0,3,6,3,0 | |
0,0,0,0,0 | 3,14,24,14,3 | |
0,0,255,0,0 | => | 6,24,38,24,6 |
0,0,0,0,0 | 3,14,24,14,3 | |
0,0,0,0,0 | 0,3,6,3,0 |
各 Gaussian blur 程式產生的圖檔皆相同
選擇較高效率的原始版本作為基礎,再利用 SSE 、 pthread 等等方法來做優化
ARM SSE Original 版本實作