# 2016q3 Homework 2 ( raytracing ) contributed by <`ierosodin`> reviewed by <`janetwei`> - 可以使用gnuplot之類的工具繪製出效能比較圖表,顯示優化前後差別 - 可以再嘗試其他優化方法,例如 SIMD ## 開發環境 作業系統 : CentOS 7 `$ lscpu` Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 12 On-line CPU(s) list: 0-11 Thread(s) per core: 2 Core(s) per socket: 6 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 45 Model name: Genuine Intel(R) CPU @ 3.30GHz Stepping: 5 CPU MHz: 1277.976 BogoMIPS: 6600.19 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 15360K NUMA node0 CPU(s): 0-11 ## 軟體安裝 ``` $ yum install graphviz $ yum install ImageMagick $ git clone https://github.com/sysprog21/raytracing $ cd raytracing $ make $ ./raytracing ``` 初次執行raytracing `Execution time of raytracing() : 3.097945 sec` ## 使用gprof分析 ``` $ make PROFILE=1 $ ./raytracing ``` 使用gprof時, 執行時間會變長 (gprof使gcc 在每個函数中都加入了一個mcount, 也就是說每個函數都會調用mcount, 增加執行時間) `Execution time of raytracing() : 5.379273 sec` ## Using gprof2dot in Centos gprof2dot能將gprof或perf的分析結果轉成 dot 格式, 裡面會描述各個節點間的關係 ``` $ git clone https://github.com/jrfonseca/gprof2dot $ gprof raytracing| ../gprof2dot/gprof2dot.py | dot -Tpng -o dot.png ``` ![](https://i.imgur.com/8il9B1s.png) ### 開始分析 `$ gprof -b raytracing gmon.out | less` ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 22.87 0.72 0.72 69646433 0.00 0.00 dot_product 19.85 1.35 0.63 56956357 0.00 0.00 subtract_vector 9.21 1.64 0.29 13861875 0.00 0.00 raySphereIntersection 8.26 1.90 0.26 10598450 0.00 0.00 normalize 7.94 2.15 0.25 13861875 0.00 0.00 rayRectangularIntersection 7.15 2.37 0.23 31410180 0.00 0.00 multiply_vector 6.04 2.56 0.19 17821809 0.00 0.00 cross_product 5.56 2.74 0.18 17836094 0.00 0.00 add_vector 3.18 2.84 0.10 4620625 0.00 0.00 ray_hit_object ``` 從gprof調用表可以發現, dot_product與subtract_vector被呼叫次數高, 佔用了許多時間, 嘗試針對這兩項進行優化 ## 嘗試一( OpenMP ) 針對dot_product()的for迴圈進行openmp ``` #include <omp.h> #pragma omp for for (i = 0; i < 3; i++) dp += v1[i] * v2[i]; ``` 結果發現執行時間變長了! ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 36.36 1.33 1.33 69646433 0.00 0.00 dot_product 16.13 1.92 0.59 56956357 0.00 0.00 subtract_vector 11.21 2.33 0.41 13861875 0.00 0.00 rayRectangularIntersection 6.29 2.56 0.23 31410180 0.00 0.00 multiply_vector 5.19 2.75 0.19 10598450 0.00 0.00 normalize 4.92 2.93 0.18 17836094 0.00 0.00 add_vector ``` 問題 : 應該是要減少呼叫dot_product的次數, 且dot_product中的for為小迴圈, 可嘗試將for迴圈展開 ## 嘗試二( loop unrolling ) 將dot_product, multiply_vector, add_vector, subtract_vector中的for迴圈展開 `Execution time of raytracing() : 2.253459 sec` ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 20.45 0.47 0.47 69646433 0.00 0.00 dot_product 11.96 0.75 0.28 13861875 0.00 0.00 rayRectangularIntersection 11.75 1.02 0.27 13861875 0.00 0.00 raySphereIntersection 10.44 1.26 0.24 10598450 0.00 0.00 normalize 7.40 1.43 0.17 31410180 0.00 0.00 multiply_vector 7.18 1.59 0.17 56956357 0.00 0.00 subtract_vector 6.96 1.75 0.16 17821809 0.00 0.00 cross_product 5.44 1.88 0.13 17836094 0.00 0.00 add_vector ``` 從表中可以分析出, loop unrolling對效能產生了影響 ## 嘗試三( OpenMP ) 發現rayRectangularIntersection與raySphereIntersection對程式效能有影響, 嘗試對raytracing.c進行優化 ( 解決raytracing()中大量的for迴圈 -> 平行化 ) 結果對效能有很明顯的提升 `Execution time of raytracing() : 0.369776 sec` ``` Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 12.80 0.39 0.39 4570655 0.00 0.00 dot_product 11.14 0.72 0.34 3595377 0.00 0.00 subtract_vector 8.98 0.99 0.27 1661550 0.00 0.00 multiply_vector 8.64 1.25 0.26 911025 0.00 0.00 raySphereIntersection 8.31 1.50 0.25 587853 0.00 0.00 normalize 7.98 1.74 0.24 1159622 0.00 0.00 cross_product 7.98 1.98 0.24 231807 0.00 0.00 ray_hit_object 6.65 2.18 0.20 825606 0.00 0.00 rayRectangularIntersection 5.49 2.35 0.17 59468 0.00 0.00 ray_color 4.32 2.48 0.13 1 0.13 2.99 raytracing 3.49 2.58 0.11 1028861 0.00 0.00 add_vector ``` 但是發現加入openmp後的結果不大理想, output出來的圖不正確 ![](https://i.imgur.com/1VmrQ6y.png) ## 其他嘗試( pthreads ) 有看到學長使用pthreads, 不過因為自己對這方面不熟, 還沒有進行嘗試 >> 不用提「這方面不熟」,沒有具體成果前,當然都是「不熟」 [name=jserv] 嘗試用pthreads改寫raytracing.c >>參考[吳彥寬的共筆](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP) ### 使用到的pthreads函式 #### main.c ```clike pthread_t *id = ( pthread_t* ) malloc( THREADNUM* sizeof( pthread_t)); ``` 宣告thread id, 並分配記憶體 ```clike rays** ptr = (rays**) malloc( THREADNUM* sizeof( rays* )); ``` 宣告一個指標型態的陣列, 使每一個id都有一個指標型態的參數(ptr[i]), 用來傳入function ```clike pthread_create( &id[i], NULL, (void*) &raytracing, (void*) ptr[i]); ``` 建立thread, 參數分別是: thread id; 屬性(一般填NULL); 要multi-threads的function; 傳入的參數 ```clike pthread_join( id[i], NULL); ``` 用來等待該id的thread結束, 第二個參數用來儲存回傳值(如果有) #### raytracing.c ```clike pthread_exit(0); ``` 放在raytracing()最後, 表示該function結束後, thread自己關閉 ### 問題解決 Q:pthreads一個function似乎只能傳入一個參數 A:改變raytracing.c中raytracing()的參數結構 -> 變成一個structure ``` rays *new_rays(uint8_t *pixels, double *background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights,const viewpoint *view, int width, int height, int id, int threadnum) { rays * r = (rays *) malloc ( sizeof( rays)); r->pixels = pixels; r->background_color = background_color; r->rectangulars = rectangulars; r->spheres = spheres; r->lights = lights; r->view = view; r->width = width; r->height = height; r->id = id; r->threadnum = threadnum; return r; } ``` 並將原本的 `void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height)` 改成 `void raytracing( void * ray)` 函式中的參數也都改用potiner (raytracing.h中也要做相對應的修改) Q:編譯時出現 ``` cc -o raytracing objects.o raytracing.o main.o -lm -lgomp /usr/bin/ld: main.o: undefined reference to symbol 'pthread_create@@GLIBC_2.2.5' /usr/bin/ld: note: 'pthread_create@@GLIBC_2.2.5' is defined in DSO /lib64/libpthread.so.0 so try adding it to the linker command line /lib64/libpthread.so.0: could not read symbols: Invalid operation collect2: error: ld returned 1 exit status make: *** [raytracing] Error 1 ``` A:需要在Makefile中的`LDFLAGS`增加`-lpthread` ### 結果 `Execution time of raytracing() : 0.320049 sec`(THREADNUM = 128) pthread能大幅提昇效能, 且得到的out.ppm也是正確的圖形 ## 嘗試四( force inline ) 呼叫函數時, 電腦會紀錄目前的記憶體位址, 然後跳至函數的記憶體位置, 等到處理完後, 再回到原先的位址, 但這樣會降低效能 使用inline可以直接展開function, 但原本的inline只能'建議'compiler, 嘗試強迫inline 在`CFLAGS`中增加`-D__forceinline="__attribute__((always_inline))"` ### 結果 `Execution time of raytracing() : 0.313396 sec` 沒有很明顯的影響 ## 參考資料 [gprof2dot github](https://github.com/jrfonseca/gprof2dot) [pThreads for Raytracing](https://embedded2016.hackpad.com/ep/pad/static/wOu40KzMaIP) [Enhance raytracing program](https://embedded2016.hackpad.com/ep/pad/static/f5CCUGMQ4Kp) [GNU gprof](https://sourceware.org/binutils/docs/gprof/)