2017q1 Homework1 (raytracing)

# 2017q1 Homework1 (raytracing) contributed by < yih6208 > ### Reviewed by `claaaaassic` * [2017q1 Homework1 (作業區)](https://hackmd.io/s/Hy2rieDYe) 裡面 github 連結是無效的，另外共筆的連結建議可以使用發表過後的版本，看起來較一致 * commit [4c07205b0fb533937efddbe9174058991b6d3ed6](https://github.com/yih6208/raytracing/commit/4c07205b0fb533937efddbe9174058991b6d3ed6) 新增的程式碼與前後 style 不同，且有多餘空白行 * 可以考慮使用 attribute((always_inline)) 的方式加快執行時間 ## First Step ### gprof #### Following the tutorial,I use gprof to analyze the performance of oringnal program. `$ gprof -b a.out gmon.out | less ` ```shell= Flat profile: Each sample counts as 0.01 seconds. % cumulative self self total time seconds seconds calls s/call s/call name 25.70 0.47 0.47 56956357 0.00 0.00 subtract_vector 23.51 0.90 0.43 69646433 0.00 0.00 dot_product 9.02 1.07 0.17 31410180 0.00 0.00 multiply_vector 7.11 1.20 0.13 13861875 0.00 0.00 rayRectangularIntersection 7.11 1.33 0.13 10598450 0.00 0.00 normalize 6.56 1.45 0.12 17836094 0.00 0.00 add_vector 6.01 1.56 0.11 17821809 0.00 0.00 cross_product 4.37 1.64 0.08 13861875 0.00 0.00 raySphereIntersection 3.83 1.71 0.07 4620625 0.00 0.00 ray_hit_object 3.01 1.76 0.06 4221152 0.00 0.00 multiply_vectors 1.64 1.79 0.03 1048576 0.00 0.00 ray_color 1.09 1.81 0.02 1241598 0.00 0.00 reflection 0.55 1.82 0.01 2110576 0.00 0.00 compute_specular_diffuse 0.55 1.83 0.01 1241598 0.00 0.00 refraction 0.00 1.83 0.00 3838091 0.00 0.00 length 0.00 1.83 0.00 2558386 0.00 0.00 idx_stack_empty 0.00 1.83 0.00 2520791 0.00 0.00 idx_stack_top 0.00 1.83 0.00 2110576 0.00 0.00 localColor 0.00 1.83 0.00 1241598 0.00 0.00 protect_color_overflow 0.00 1.83 0.00 1204003 0.00 0.00 idx_stack_push 0.00 1.83 0.00 1048576 0.00 0.00 idx_stack_init 0.00 1.83 0.00 1048576 0.00 0.00 rayConstruction 0.00 1.83 0.00 113297 0.00 0.00 fresnel 0.00 1.83 0.00 37595 0.00 0.00 idx_stack_pop 0.00 1.83 0.00 3 0.00 0.00 append_rectangular 0.00 1.83 0.00 3 0.00 0.00 append_sphere 0.00 1.83 0.00 2 0.00 0.00 append_light 0.00 1.83 0.00 1 0.00 0.00 calculateBasisVectors 0.00 1.83 0.00 1 0.00 0.00 delete_light_list 0.00 1.83 0.00 1 0.00 0.00 delete_rectangular_list 0.00 1.83 0.00 1 0.00 0.00 delete_sphere_list 0.00 1.83 0.00 1 0.00 0.00 diff_in_second 0.00 1.83 0.00 1 0.00 1.83 raytracing 0.00 1.83 0.00 1 0.00 0.00 write_to_ppm ``` According to the above table, I discover that the fuction program spent most time on are the "subtract_vector" and the "dot_product". These two functions spend about 50% time of program excution. So,if I want to improve the performance dramastically,I should focus on these function first. ## Second Step ### Improve #### subtract_vector() & dot_product() ```C= static inline void subtract_vector(const double *a, const double *b, double *out) { for (int i = 0; i < 3; i++) out[i] = a[i] - b[i]; } ``` ~~I think the function can be done by parallel mode. I can use thread to improve the speed.~~ I found that I should use loop unrolling first. ```c= static inline void subtract_vector(const double *a, const double *b, double *out) { out[0] = a[0] - b[0]; out[1] = a[1] - b[1]; out[2] = a[2] - b[2]; } ``` Using the loop unrolling will cost more space to store and make the readability less but it can make CPU's branch prediction more precisely.That's why it can reduce the time of excution. The oringal time of excution is 2.233555 sec The optimal one is 1.522846 sec **It only take 68% time after optimized.** ## Third Step After reading the raytrcing.c , I noticed that the three-level for-loop in raytrcing() can make it parallel. ```c= for (int j = 0; j < height; j++) { for (int i = 0; i < width; i++) { double r = 0, g = 0, b = 0; /* MSAA */ for (int s = 0; s < SAMPLES; s++) { idx_stack_init(&stk); rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor); if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) { r += object_color[0]; g += object_color[1]; b += object_color[2]; } else { r += background_color[0]; g += background_color[1]; b += background_color[2]; } pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES; pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES; } } } ``` ### OpenMP- parallel command I tend to use the OpenMP to make it parallel. At first, I only use the command below ```c= /*In void raytracing */ #pragma omp parallel for for ( j = 0; j < height; j++) { for ( i = 0; i < width; i++) { . . . . ``` After execution my result is wrong. I think there should be some critical section or variables which I didn't avoid it's used between threads. ![](https://i.imgur.com/DsJeToR.png) ### OpenMP- Private Variable Command I googled for the solution,and I found I need to specify some variables in "private" type which will make OpenMP make a copy of the variable for each threads to avoid race condition. ```c= /*In void raytracing */ #pragma omp parallel for private(stk,d,object_color) for ( j = 0; j < height; j++) { for ( i = 0; i < width; i++) { . . . . ``` Reference:http://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf But it's still have some problem. ![](https://i.imgur.com/5I81Zpn.png) ### OpenMP - Optimize nested for-loop I finally succeeded. I use "collapse(2)" which means to optimized for 2-level nested for-loop. In the default mode the OpenMP will only optimize for one-level for-loop. ```c= /*In void raytracing */ #pragma omp parallel for collapse(2) private(stk,d,object_color) for ( j = 0; j < height; j++) { for ( i = 0; i < width; i++) { . . . . ``` ![](https://i.imgur.com/yYmxLJ7.png) Finally, the output is correct !!! !!! ```shell= Rendering scene Done! Execution time of raytracing() : 0.646787 sec ``` ### OpenMP - Optimize nested for-loop ![](https://i.imgur.com/zBHH73r.png) http://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf ![](https://i.imgur.com/TXF5Ts7.png) http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html **I notice that the schedule will impact on the parallel's performance,so I start to test different schdule plan.** I run each mode for 10 times to get average. You can see the "guided" and "dynamic" mode costing time are close. ![](https://i.imgur.com/qB3leAv.png) So, I run these two mode for 100 times to get average. ![](https://i.imgur.com/60I30qU.png) We can oberserved that the guided mode used a little bit less time compared to the dynamic mode. The oringal time of excution is 2.233555 sec The loop-unrolling optimal one is 1.522846 sec The OpenMP optimal one is 0.3617023 sec **It only take 16.2% time after optimized.**

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.