owned this note
owned this note
Published
Linked with GitHub
# 2017q1 Homework1 (raytracing)
contributed by < yih6208 >
### Reviewed by `claaaaassic`
* [2017q1 Homework1 (作業區)](https://hackmd.io/s/Hy2rieDYe) 裡面 github 連結是無效的,另外共筆的連結建議可以使用發表過後的版本,看起來較一致
* commit [4c07205b0fb533937efddbe9174058991b6d3ed6](https://github.com/yih6208/raytracing/commit/4c07205b0fb533937efddbe9174058991b6d3ed6) 新增的程式碼與前後 style 不同,且有多餘空白行
* 可以考慮使用 attribute((always_inline)) 的方式加快執行時間
## First Step
### gprof
#### Following the tutorial,I use gprof to analyze the performance of oringnal program.
`$ gprof -b a.out gmon.out | less `
```shell=
Flat profile:
Each sample counts as 0.01 seconds.
% cumulative self self total
time seconds seconds calls s/call s/call name
25.70 0.47 0.47 56956357 0.00 0.00 subtract_vector
23.51 0.90 0.43 69646433 0.00 0.00 dot_product
9.02 1.07 0.17 31410180 0.00 0.00 multiply_vector
7.11 1.20 0.13 13861875 0.00 0.00 rayRectangularIntersection
7.11 1.33 0.13 10598450 0.00 0.00 normalize
6.56 1.45 0.12 17836094 0.00 0.00 add_vector
6.01 1.56 0.11 17821809 0.00 0.00 cross_product
4.37 1.64 0.08 13861875 0.00 0.00 raySphereIntersection
3.83 1.71 0.07 4620625 0.00 0.00 ray_hit_object
3.01 1.76 0.06 4221152 0.00 0.00 multiply_vectors
1.64 1.79 0.03 1048576 0.00 0.00 ray_color
1.09 1.81 0.02 1241598 0.00 0.00 reflection
0.55 1.82 0.01 2110576 0.00 0.00 compute_specular_diffuse
0.55 1.83 0.01 1241598 0.00 0.00 refraction
0.00 1.83 0.00 3838091 0.00 0.00 length
0.00 1.83 0.00 2558386 0.00 0.00 idx_stack_empty
0.00 1.83 0.00 2520791 0.00 0.00 idx_stack_top
0.00 1.83 0.00 2110576 0.00 0.00 localColor
0.00 1.83 0.00 1241598 0.00 0.00 protect_color_overflow
0.00 1.83 0.00 1204003 0.00 0.00 idx_stack_push
0.00 1.83 0.00 1048576 0.00 0.00 idx_stack_init
0.00 1.83 0.00 1048576 0.00 0.00 rayConstruction
0.00 1.83 0.00 113297 0.00 0.00 fresnel
0.00 1.83 0.00 37595 0.00 0.00 idx_stack_pop
0.00 1.83 0.00 3 0.00 0.00 append_rectangular
0.00 1.83 0.00 3 0.00 0.00 append_sphere
0.00 1.83 0.00 2 0.00 0.00 append_light
0.00 1.83 0.00 1 0.00 0.00 calculateBasisVectors
0.00 1.83 0.00 1 0.00 0.00 delete_light_list
0.00 1.83 0.00 1 0.00 0.00 delete_rectangular_list
0.00 1.83 0.00 1 0.00 0.00 delete_sphere_list
0.00 1.83 0.00 1 0.00 0.00 diff_in_second
0.00 1.83 0.00 1 0.00 1.83 raytracing
0.00 1.83 0.00 1 0.00 0.00 write_to_ppm
```
According to the above table, I discover that the fuction program spent most time on are the "subtract_vector" and the "dot_product". These two functions spend about 50% time of program excution. So,if I want to improve the performance dramastically,I should focus on these function first.
## Second Step
### Improve
#### subtract_vector() & dot_product()
```C=
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
for (int i = 0; i < 3; i++)
out[i] = a[i] - b[i];
}
```
~~I think the function can be done by parallel mode. I can use thread to improve the speed.~~
I found that I should use loop unrolling first.
```c=
static inline
void subtract_vector(const double *a, const double *b, double *out)
{
out[0] = a[0] - b[0];
out[1] = a[1] - b[1];
out[2] = a[2] - b[2];
}
```
Using the loop unrolling will cost more space to store and make the readability less but it can make CPU's branch prediction more precisely.That's why it can reduce the time of excution.
The oringal time of excution is 2.233555 sec
The optimal one is 1.522846 sec
**It only take 68% time after optimized.**
## Third Step
After reading the raytrcing.c , I noticed that the three-level for-loop in raytrcing() can make it parallel.
```c=
for (int j = 0; j < height; j++) {
for (int i = 0; i < width; i++) {
double r = 0, g = 0, b = 0;
/* MSAA */
for (int s = 0; s < SAMPLES; s++) {
idx_stack_init(&stk);
rayConstruction(d, u, v, w, i * factor + s / factor, j * factor + s % factor, view, width * factor, height * factor);
if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres, lights, object_color, MAX_REFLECTION_BOUNCES)) {
r += object_color[0];
g += object_color[1];
b += object_color[2];
} else {
r += background_color[0];
g += background_color[1];
b += background_color[2];
}
pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
}
}
}
```
### OpenMP- parallel command
I tend to use the OpenMP to make it parallel.
At first, I only use the command below
```c=
/*In void raytracing */
#pragma omp parallel for
for ( j = 0; j < height; j++) {
for ( i = 0; i < width; i++) {
.
.
.
.
```
After execution my result is wrong.
I think there should be some critical section or variables which I didn't avoid it's used between threads.

### OpenMP- Private Variable Command
I googled for the solution,and I found I need to specify some variables in "private" type which will make OpenMP make a copy of the variable for each threads to avoid race condition.
```c=
/*In void raytracing */
#pragma omp parallel for private(stk,d,object_color)
for ( j = 0; j < height; j++) {
for ( i = 0; i < width; i++) {
.
.
.
.
```
Reference:http://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf
But it's still have some problem.

### OpenMP - Optimize nested for-loop
I finally succeeded. I use "collapse(2)" which means to optimized for 2-level nested for-loop. In the default mode the OpenMP will only optimize for one-level for-loop.
```c=
/*In void raytracing */
#pragma omp parallel for collapse(2) private(stk,d,object_color)
for ( j = 0; j < height; j++) {
for ( i = 0; i < width; i++) {
.
.
.
.
```

Finally, the output is correct !!! !!!
```shell=
Rendering scene
Done!
Execution time of raytracing() : 0.646787 sec
```
### OpenMP - Optimize nested for-loop

http://www.openmp.org/wp-content/uploads/omp-hands-on-SC08.pdf

http://pages.tacc.utexas.edu/~eijkhout/pcse/html/omp-loop.html
**I notice that the schedule will impact on the parallel's performance,so I start to test different schdule plan.**
I run each mode for 10 times to get average. You can see the "guided" and "dynamic" mode costing time are close.

So, I run these two mode for 100 times to get average.

We can oberserved that the guided mode used a little bit less time compared to the dynamic mode.
The oringal time of excution is 2.233555 sec
The loop-unrolling optimal one is 1.522846 sec
The OpenMP optimal one is 0.3617023 sec
**It only take 16.2% time after optimized.**