contributed by <CheHsuan
>, <kevinbird61
>, <SwimGlass
>, <carolc0708
>
sysprog
中英文字間請記得以空白區隔
課程助教OK!
kevinbird61
高斯模糊
水平、垂直翻轉
HSV飽和度、明亮度調整
以下多個 function 實作時間評估的圖片,皆為對圖片執行一次高斯模糊,並且使用 4 個執行序來實作結果
// Original structure
typedef struct tagRGBTRIPLE { //(3bytes)
BYTE rgbBlue; //(1bytes) blue channel
BYTE rgbGreen; //(1bytes) green channel
BYTE rgbRed; //(1bytes) red channel
} RGBTRIPLE;
// split structure
color_r = (unsigned char*)malloc(bmpInfo.biWidth*bmpInfo.biHeight*sizeof(unsigned char));
color_g = (unsigned char*)malloc(bmpInfo.biWidth*bmpInfo.biHeight*sizeof(unsigned char));
color_b = (unsigned char*)malloc(bmpInfo.biWidth*bmpInfo.biHeight*sizeof(unsigned char));
$~: make gau_blur_tri
gcc main.c -DSPLIT=7 -DGAUSSIAN=1 -o bmpreader
$~: make run
# img/wf.bmp => has the 4 element(alpha value)
./bmpreader img/input.bmp output.bmp
Picture size of picture is width: 1600 , height 1200
Read file successfully
Gaussian blur[5x5][split structure], execution time : 0.395602 sec
Save file successfully
eog output.bmp
...
$~: make gau_blur_ori
gcc main.c -DGAUSSIAN=2 -o bmpreader
$~: make run
# img/wf.bmp => has the 4 element(alpha value)
./bmpreader img/input.bmp output.bmp
Picture size of picture is width: 1600 , height 1200
Read file successfully
Gaussian blur[5x5][original structure], execution time : 0.322005 sec
Save file successfully
eog output.bmp
0.073597
秒的差距考慮到高斯模糊的實作方式:3x3 跟 5x5,以及考慮 data structure 的情況:
我們先 Load 進5行(每行16 bytes)的大小來做,找出其中5x5的元素乘上高斯模糊係數,並且加總後放回原先的位置
考慮到使用 SSE 的效益,先實作看看unroll版本
...
sum = src[(j-2)*w+(i-2)]*gaussian55[index++] + src[(j-2)*w+(i-1)]*gaussian55[index++]
+ src[(j-2)*w+(i)]*gaussian55[index++] + src[(j-2)*w+(i+1)]*gaussian55[index++]
+ src[(j-2)*w+(i+2)]*gaussian55[index++] + src[(j-1)*w+(i-2)]*gaussian55[index++]
+ src[(j-1)*w+(i-1)]*gaussian55[index++] + src[(j-1)*w+(i)]*gaussian55[index++]
+ src[(j-1)*w+(i+1)]*gaussian55[index++] + src[(j-1)*w+(i+2)]*gaussian55[index++]
+ src[(j)*w+(i-2)]*gaussian55[index++] + src[(j)*w+(i-1)]*gaussian55[index++]
+ src[(j)*w+(i)]*gaussian55[index++] + src[(j)*w+(i+1)]*gaussian55[index++]
+ src[(j)*w+(i+2)]*gaussian55[index++] + src[(j+1)*w+(i-2)]*gaussian55[index++]
+ src[(j+1)*w+(i-1)]*gaussian55[index++] + src[(j+1)*w+(i)]*gaussian55[index++]
+ src[(j+1)*w+(i+1)]*gaussian55[index++] + src[(j+1)*w+(i+2)]*gaussian55[index++]
+ src[(j+2)*w+(i-2)]*gaussian55[index++] + src[(j+2)*w+(i-1)]*gaussian55[index++]
+ src[(j+2)*w+(i)]*gaussian55[index++] + src[(j+2)*w+(i+1)]*gaussian55[index++]
+ src[(j+2)*w+(i+2)]*gaussian55[index++];
分析執行結果:經過更改後的 makefile ,可以讓使用者決定要對該圖檔執行幾次的 Gaussian blur ;而比較原本 5x5 分開的 structure , original 來做,連續執行 3 次前後效能比較:
可以看到,Split vs unroll 一共減少0.401668
秒;而Original vs unroll 則是減少了0.088185
秒
而實作 original structure unroll 版本,得到的執行時間是: 0.985009 sec
;竟然比原先的還要來的慢…
1.842012
秒3.015885
秒
...
__m128i vg1 = _mm_loadu_si128((__m128i *)sse_g1);
__m128i vg2 = _mm_loadu_si128((__m128i *)sse_g2);
__m128i vg3 = _mm_loadu_si128((__m128i *)sse_g3);
__m128i vg4 = _mm_loadu_si128((__m128i *)sse_g4);
__m128i vg5 = _mm_loadu_si128((__m128i *)sse_g5);
__m128i vsum = _mm_set1_epi8(0),vtemplow = _mm_set1_epi8(0),vtemphigh = _mm_set1_epi8(0),vempty = _mm_set1_epi8(0);
// First element src[j*w+i]
// Load in data
__m128i L0 = _mm_loadu_si128((__m128i *)(src+(j+0)*w + i));
__m128i L1 = _mm_loadu_si128((__m128i *)(src+(j+1)*w + i));
__m128i L2 = _mm_loadu_si128((__m128i *)(src+(j+2)*w + i));
__m128i L3 = _mm_loadu_si128((__m128i *)(src+(j+3)*w + i));
__m128i L4 = _mm_loadu_si128((__m128i *)(src+(j+4)*w + i));
// Get the data we need (5 element per-line) , because we only
// need 5 element from sse instruction set , so only get low part(contain 8 elements)
__m128i v0 = _mm_unpacklo_epi8(L0,vk0);
__m128i v1 = _mm_unpacklo_epi8(L1,vk0);
__m128i v2 = _mm_unpacklo_epi8(L2,vk0);
__m128i v3 = _mm_unpacklo_epi8(L3,vk0);
__m128i v4 = _mm_unpacklo_epi8(L4,vk0);
// Multiple with specific Gaussian coef.
v0 = _mm_maddubs_epi16(v0,vg1);
v1 = _mm_maddubs_epi16(v1,vg2);
v2 = _mm_maddubs_epi16(v2,vg3);
v3 = _mm_maddubs_epi16(v3,vg4);
v4 = _mm_maddubs_epi16(v4,vg5);
// Summation the 5 line
vsum = _mm_add_epi16(vsum,v0);
vsum = _mm_add_epi16(vsum,v1);
vsum = _mm_add_epi16(vsum,v2);
vsum = _mm_add_epi16(vsum,v3);
vsum = _mm_add_epi16(vsum,v4);
// Vsum summation
// Summation all - (Summation all - (Summation with shift-off 5 number))
vtemplow = _mm_unpacklo_epi16(vsum,vempty); // 1,2,3,4
vtemphigh = _mm_unpackhi_epi16(vsum,vempty); // 5
sum += _mm_cvtsi128_si32(vtemplow); // get 1
sum += _mm_cvtsi128_si32(_mm_srli_si128(vtemplow,4)); // get 2
sum += _mm_cvtsi128_si32(_mm_srli_si128(vtemplow,8)); // get 3
sum += _mm_cvtsi128_si32(_mm_srli_si128(vtemplow,12)); // get 4
sum += _mm_cvtsi128_si32(vtemphigh); // get 5
...
0.402466
秒( 4 條 thread)!!由於一次load進16bytes,考慮到原本資料結構為 RGBTRIPLE(大小為3 bytes ),如果一次 load 進 5 個元素,就是 5*3 = 15 bytes ,比起原先 16 個中只能 load 5 個 bytes 來說,更加利用到空間
把原先 BGRBGRBG… 排列得資料利用 _mm_unpacklo_epi8
加上為 0 的 vector 來輸出: B0G0R0B0G0R0…的樣式,此時由原先的一個 __m128i 的變數,變成兩個(總共從 5 個變成 10 個);接著利用_mm_maddubs_epi16
,變形後的來跟變形的 gaussian kernel 矩陣做相乘的動作,並利用_mm_add_epi16
依序兩兩相加;最後取得兩個 __m128i 的變數,代表了所有項的總和值;為了能夠使用_mm_cvtsi128_si32
來取得[32 : 0]的值,我們先利用_mm_unpacklo_epi16
把原先[ Color ][ 0 ][ 0 ][ Color ][ 0 ][ 0 ]
…擴展成[ Color ][ 0 ][ 0 ][ 0 ][ 0 ][ Color ][ 0 ]
…的方式(每個單元以32bits為單位做操作); 最後再利用_mm_cvtsi128_si32
配合_mm_srli_si128
作移位,一一取得交錯的 R 、 G 、 B 的值來加總,並儲存回圖片位置。
而依照記憶體位置來看,這15 bytes 分別會以 B->G->R 的順序作排列直到底為止
實作後,可以看到變化如下:
可以從紅色的曲線( sse original )來看到,基本上已經超越大部分的實作!!!(可喜可賀)
由於改成這個處理方式後,把原本 split 時所需要處理三次回圈的部份改成只需要處理一次( r,g,b 同時做);這樣我們再處理上,就可以使用特化過後的 gaussian kernel 來幫每個移動過後的元素作相乘得動作。最後再依序取出並加總變可以得到最後該位置的值!
後續:加上prefetch後,發現沒有比較快
實驗:pthread 交錯工作,並檢視效能
分為水平翻轉和垂直翻轉,水平翻轉也就是常見的字串反轉
翻轉 | naive(ori) | naive(tri) | openmp(tri) | sse(tri) |
水平翻轉 | 0.006459 | 0.016355 | 0.016314 | 0.004591 |
垂直翻轉 | 0.006776 | 0.022558 | 0.022308 | 0.001510 |
先將RGB轉到HSV表示,將S值和V值上調或下調後,再將HSV轉回GRB