contributed by <ierosodin
>
reviewed by <janetwei
>
作業系統 : CentOS 7
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 12
On-line CPU(s) list: 0-11
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 45
Model name: Genuine Intel® CPU @ 3.30GHz
Stepping: 5
CPU MHz: 1277.976
BogoMIPS: 6600.19
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 15360K
NUMA node0 CPU(s): 0-11
$ yum install graphviz
$ yum install ImageMagick
$ git clone https://github.com/sysprog21/raytracing
$ cd raytracing
$ make
$ ./raytracing
初次執行raytracing
Execution time of raytracing() : 3.097945 sec
$ make PROFILE=1
$ ./raytracing
使用gprof時, 執行時間會變長
(gprof使gcc 在每個函数中都加入了一個mcount, 也就是說每個函數都會調用mcount, 增加執行時間)
Execution time of raytracing() : 5.379273 sec
gprof2dot能將gprof或perf的分析結果轉成 dot 格式, 裡面會描述各個節點間的關係
$ git clone https://github.com/jrfonseca/gprof2dot
$ gprof raytracing| ../gprof2dot/gprof2dot.py | dot -Tpng -o dot.png
$ gprof -b raytracing gmon.out | less
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
22.87 0.72 0.72 69646433 0.00 0.00 dot_product
19.85 1.35 0.63 56956357 0.00 0.00 subtract_vector
9.21 1.64 0.29 13861875 0.00 0.00 raySphereIntersection
8.26 1.90 0.26 10598450 0.00 0.00 normalize
7.94 2.15 0.25 13861875 0.00 0.00 rayRectangularIntersection
7.15 2.37 0.23 31410180 0.00 0.00 multiply_vector
6.04 2.56 0.19 17821809 0.00 0.00 cross_product
5.56 2.74 0.18 17836094 0.00 0.00 add_vector
3.18 2.84 0.10 4620625 0.00 0.00 ray_hit_object
從gprof調用表可以發現, dot_product與subtract_vector被呼叫次數高, 佔用了許多時間, 嘗試針對這兩項進行優化
針對dot_product()的for迴圈進行openmp
#include <omp.h>
#pragma omp for
for (i = 0; i < 3; i++)
dp += v1[i] * v2[i];
結果發現執行時間變長了!
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
36.36 1.33 1.33 69646433 0.00 0.00 dot_product
16.13 1.92 0.59 56956357 0.00 0.00 subtract_vector
11.21 2.33 0.41 13861875 0.00 0.00 rayRectangularIntersection
6.29 2.56 0.23 31410180 0.00 0.00 multiply_vector
5.19 2.75 0.19 10598450 0.00 0.00 normalize
4.92 2.93 0.18 17836094 0.00 0.00 add_vector
問題 : 應該是要減少呼叫dot_product的次數, 且dot_product中的for為小迴圈, 可嘗試將for迴圈展開
將dot_product, multiply_vector, add_vector, subtract_vector中的for迴圈展開
Execution time of raytracing() : 2.253459 sec
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
20.45 0.47 0.47 69646433 0.00 0.00 dot_product
11.96 0.75 0.28 13861875 0.00 0.00 rayRectangularIntersection
11.75 1.02 0.27 13861875 0.00 0.00 raySphereIntersection
10.44 1.26 0.24 10598450 0.00 0.00 normalize
7.40 1.43 0.17 31410180 0.00 0.00 multiply_vector
7.18 1.59 0.17 56956357 0.00 0.00 subtract_vector
6.96 1.75 0.16 17821809 0.00 0.00 cross_product
5.44 1.88 0.13 17836094 0.00 0.00 add_vector
從表中可以分析出, loop unrolling對效能產生了影響
發現rayRectangularIntersection與raySphereIntersection對程式效能有影響, 嘗試對raytracing.c進行優化
( 解決raytracing()中大量的for迴圈 -> 平行化 )
結果對效能有很明顯的提升
Execution time of raytracing() : 0.369776 sec
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
12.80 0.39 0.39 4570655 0.00 0.00 dot_product
11.14 0.72 0.34 3595377 0.00 0.00 subtract_vector
8.98 0.99 0.27 1661550 0.00 0.00 multiply_vector
8.64 1.25 0.26 911025 0.00 0.00 raySphereIntersection
8.31 1.50 0.25 587853 0.00 0.00 normalize
7.98 1.74 0.24 1159622 0.00 0.00 cross_product
7.98 1.98 0.24 231807 0.00 0.00 ray_hit_object
6.65 2.18 0.20 825606 0.00 0.00 rayRectangularIntersection
5.49 2.35 0.17 59468 0.00 0.00 ray_color
4.32 2.48 0.13 1 0.13 2.99 raytracing
3.49 2.58 0.11 1028861 0.00 0.00 add_vector
但是發現加入openmp後的結果不大理想, output出來的圖不正確
有看到學長使用pthreads, 不過因為自己對這方面不熟, 還沒有進行嘗試
不用提「這方面不熟」,沒有具體成果前,當然都是「不熟」 jserv
嘗試用pthreads改寫raytracing.c
參考吳彥寬的共筆
pthread_t *id = ( pthread_t* ) malloc( THREADNUM* sizeof( pthread_t));
宣告thread id, 並分配記憶體
rays** ptr = (rays**) malloc( THREADNUM* sizeof( rays* ));
宣告一個指標型態的陣列, 使每一個id都有一個指標型態的參數(ptr[i]), 用來傳入function
pthread_create( &id[i], NULL, (void*) &raytracing, (void*) ptr[i]);
建立thread, 參數分別是:
thread id; 屬性(一般填NULL); 要multi-threads的function; 傳入的參數
pthread_join( id[i], NULL);
用來等待該id的thread結束, 第二個參數用來儲存回傳值(如果有)
pthread_exit(0);
放在raytracing()最後, 表示該function結束後, thread自己關閉
Q:pthreads一個function似乎只能傳入一個參數
A:改變raytracing.c中raytracing()的參數結構 -> 變成一個structure
rays *new_rays(uint8_t *pixels, double *background_color,
rectangular_node rectangulars, sphere_node spheres,
light_node lights,const viewpoint *view,
int width, int height, int id, int threadnum)
{
rays * r = (rays *) malloc ( sizeof( rays));
r->pixels = pixels;
r->background_color = background_color;
r->rectangulars = rectangulars;
r->spheres = spheres;
r->lights = lights;
r->view = view;
r->width = width;
r->height = height;
r->id = id;
r->threadnum = threadnum;
return r;
}
並將原本的
void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height)
改成
void raytracing( void * ray)
函式中的參數也都改用potiner
(raytracing.h中也要做相對應的修改)
Q:編譯時出現
cc -o raytracing objects.o raytracing.o main.o -lm -lgomp
/usr/bin/ld: main.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5'
/usr/bin/ld: note: 'pthread_create@@GLIBC_2.2.5' is defined in DSO /lib64/libpthread.so.0 so try adding it to the linker command line
/lib64/libpthread.so.0: could not read symbols: Invalid operation
collect2: error: ld returned 1 exit status
make: *** [raytracing] Error 1
A:需要在Makefile中的LDFLAGS
增加-lpthread
Execution time of raytracing() : 0.320049 sec
(THREADNUM = 128)
pthread能大幅提昇效能, 且得到的out.ppm也是正確的圖形
呼叫函數時, 電腦會紀錄目前的記憶體位址, 然後跳至函數的記憶體位置, 等到處理完後, 再回到原先的位址, 但這樣會降低效能
使用inline可以直接展開function, 但原本的inline只能'建議'compiler, 嘗試強迫inline
在CFLAGS
中增加-D__forceinline="__attribute__((always_inline))"
Execution time of raytracing() : 0.313396 sec
沒有很明顯的影響
gprof2dot github
pThreads for Raytracing
Enhance raytracing program
GNU gprof