Try   HackMD

2016q3 Homework1 (raytracing)

tags: heathcliffYang

contributed by <heathcliffYang>

Implementation note

I had better to read the Makefile first.

make check will verify the picture.
convert can convert the ppm to png or else.
make PROFILE=1 -> compile the program needed by gprof in Makefile.
make clean will remove use-models.h,out.ppm and gmon.out.
gprof -b ./raytracing | less -> 從上方顯示較少資訊,並提供翻頁,-b是不需要解釋名詞
gprof ./raytracing | gprof2dot | dot -T png -o callgraph_omp1.png

Try & Problems

9/26 8 pm

1. orig execution time of raytracing()

​6.232054 sec

2. Rerun for gprof to collect data

  • gprof
heathcliff@heathcliff-MacBookPro:~/raytracing$ gprof -b ./raytracing
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 20.34      0.50     0.50 69646433     0.00     0.00  dot_product
 16.68      0.91     0.41 56956357     0.00     0.00  subtract_vector
 13.42      1.24     0.33 10598450     0.00     0.00  normalize
 11.80      1.53     0.29 31410180     0.00     0.00  multiply_vector
  9.36      1.76     0.23 13861875     0.00     0.00  rayRectangularIntersection
  6.91      1.93     0.17 13861875     0.00     0.00  raySphereIntersection
  4.47      2.04     0.11  4620625     0.00     0.00  ray_hit_object
  3.46      2.13     0.09 17836094     0.00     0.00  add_vector
  2.44      2.19     0.06 17821809     0.00     0.00  cross_product
  2.03      2.24     0.05  1048576     0.00     0.00  ray_color
  1.63      2.28     0.04  4221152     0.00     0.00  multiply_vectors
  1.63      2.32     0.04  2110576     0.00     0.00  compute_specular_diffuse
  1.22      2.35     0.03  2520791     0.00     0.00  idx_stack_top
  1.22      2.38     0.03  2110576     0.00     0.00  localColor
  0.81      2.40     0.02  1241598     0.00     0.00  refraction
  • perf
 Performance counter stats for './raytracing' (10 runs):

            78,085      cache-misses              #   19.826 % of all cache refs      ( +- 11.39% )
           393,846      cache-references                                              ( +-  3.41% )
    34,500,667,719      instructions              #    1.75  insns per cycle          ( +-  0.00% )
    19,734,977,254      cycles                                                        ( +-  0.83% )

       6.453942844 seconds time elapsed                                          ( +-  0.82% )

Loop unrolling - 9/27 7pm

loop unrolling 所有的

1.

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 16.57      0.27     0.27 13861875     0.00     0.00  rayRectangularIntersection
 14.70      0.50     0.24 56956357     0.00     0.00  subtract_vector ***
 13.13      0.71     0.21 69646433     0.00     0.00  dot_product     ***
 11.88      0.90     0.19 10598450     0.00     0.00  normalize       ***
  8.44      1.04     0.14  4620625     0.00     0.00  ray_hit_object
  7.50      1.16     0.12 13861875     0.00     0.00  raySphereIntersection
  6.88      1.27     0.11 17836094     0.00     0.00  add_vector      ***
  5.32      1.35     0.09 17821809     0.00     0.00  cross_product   ***
  3.13      1.40     0.05  3838091     0.00     0.00  length          ***
  2.19      1.44     0.04 31410180     0.00     0.00  multiply_vector ***
  1.88      1.47     0.03  2110576     0.00     0.00  compute_specular_diffuse
  1.88      1.50     0.03  1048576     0.00     0.00  ray_color
  1.56      1.52     0.03  4221152     0.00     0.00  multiply_vectors***
  0.94      1.54     0.02   113297     0.00     0.00  fresnel
  0.94      1.55     0.02  2520791     0.00     0.00  idx_stack_top
  0.63      1.56     0.01  2110576     0.00     0.00  localColor
  0.63      1.57     0.01  1241598     0.00     0.00  protect_color_overflow
  0.63      1.58     0.01  1241598     0.00     0.00  reflection
  0.63      1.59     0.01        1     0.01     0.01  delete_sphere_list
  0.63      1.60     0.01        1     0.01     1.59  raytracing
  0.00      1.60     0.00  2558386     0.00     0.00  idx_stack_empty
  0.00      1.60     0.00  1241598     0.00     0.00  refraction
  0.00      1.60     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.60     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.60     0.00  1048576     0.00     0.00  rayConstruction
  0.00      1.60     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.60     0.00        3     0.00     0.00  append_rectangular
  0.00      1.60     0.00        3     0.00     0.00  append_sphere
  0.00      1.60     0.00        2     0.00     0.00  append_light
  0.00      1.60     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.60     0.00        1     0.00     0.00  delete_light_list
  0.00      1.60     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.60     0.00        1     0.00     0.00  diff_in_second
  0.00      1.60     0.00        1     0.00     0.00  write_to_ppm

2. perf

 Performance counter stats for './raytracing' (10 runs):

            59,245      cache-misses              #   19.567 % of all cache refs      ( +-  1.89% )
           302,776      cache-references                                              ( +-  7.82% )
    25,746,948,831      instructions              #    1.61  insns per cycle          ( +-  0.00% )
    15,999,672,351      cycles                                                        ( +-  0.43% )

       5.210404034 seconds time elapsed                                          ( +-  0.56% )

OpenMP - 針對迴圈平行化 9/27 9pm

1. Code

void normalize(double *v)
{
    double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    assert(d != 0.0 && "Error calculating normal");
	
    #pragma omp parallel for
    for (a=0; a < 3; a++) {
        v[a]/=d;
    }

}

2. 遇到的問題

嘗試把所有math-toolkit.h裡的函式可以平行化的迴圈都平行化了,但是反而更慢更久;
之後換成只改其中一個函式,也跑得很久,這兩個都沒等到跑完。

強迫改變函式屬性為always_inline - 9/27 11 pm

all

1. Code

static __attribute__((always_inline))
void normalize(double *v)
{
    double d = sqrt(v[0] * v[0] + v[1] * v[1] + v[2] * v[2]);
    assert(d != 0.0 && "Error calculating normal");

    v[0] /= d;
    v[1] /= d;
    v[2] /= d;
}

2. gprof

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 30.20      0.47     0.47 13861875     0.00     0.00  rayRectangularIntersection
 15.91      0.71     0.25 13861875     0.00     0.00  raySphereIntersection
 14.29      0.93     0.22  2110576     0.00     0.00  compute_specular_diffuse
  9.74      1.08     0.15  1048576     0.00     0.00  ray_color
  9.09      1.22     0.14  2110576     0.00     0.00  localColor
  6.82      1.33     0.11  4620625     0.00     0.00  ray_hit_object
  5.85      1.42     0.09  1048576     0.00     0.00  rayConstruction
  2.60      1.46     0.04        1     0.04     1.54  raytracing
  1.30      1.48     0.02  1241598     0.00     0.00  reflection
  1.30      1.50     0.02  1241598     0.00     0.00  refraction
  0.97      1.51     0.02   113297     0.00     0.00  fresnel
  0.65      1.52     0.01  2558386     0.00     0.00  idx_stack_empty
  0.65      1.53     0.01  2520791     0.00     0.00  idx_stack_top
  0.65      1.54     0.01  1241598     0.00     0.00  protect_color_overflow
  0.00      1.54     0.00  1204003     0.00     0.00  idx_stack_push
  0.00      1.54     0.00  1048576     0.00     0.00  idx_stack_init
  0.00      1.54     0.00    37595     0.00     0.00  idx_stack_pop
  0.00      1.54     0.00        3     0.00     0.00  append_rectangular
  0.00      1.54     0.00        3     0.00     0.00  append_sphere
  0.00      1.54     0.00        2     0.00     0.00  append_light
  0.00      1.54     0.00        1     0.00     0.00  calculateBasisVectors
  0.00      1.54     0.00        1     0.00     0.00  delete_light_list
  0.00      1.54     0.00        1     0.00     0.00  delete_rectangular_list
  0.00      1.54     0.00        1     0.00     0.00  delete_sphere_list
  0.00      1.54     0.00        1     0.00     0.00  diff_in_second
  0.00      1.54     0.00        1     0.00     0.00  write_to_ppm

3. perf stat

 Performance counter stats for './raytracing' (10 runs):

            53,970      cache-misses              #   19.476 % of all cache refs      ( +-  2.84% )
           277,108      cache-references                                              ( +-  6.25% )
    13,144,659,166      instructions              #    1.70  insns per cycle          ( +-  0.00% )
     7,750,237,141      cycles                                                        ( +-  0.74% )

       2.540274253 seconds time elapsed                                          ( +-  1.00% )

感謝hugikun999的提醒

第二次嘗試 OpenMP + inline - 9/28 1am

1. 發現問題

加了inline之後,又試著在math-toolkit.h裡用OpenMP,但時間依然很長(兩百多秒);後來想到,每個函式被叫用那麼多次,但是每個都是迴圈短、運算比較簡單的時候,就不需要每個簡單的運算都用一個thread去執行。

2. call graph

  • Loop unrolling + inline (尚未加上omp)

  • ray_color() & rayConstruction()

3. 嘗試修改 raytracing()

/* @param background_color this is not ambient light */
void raytracing(uint8_t *pixels, color background_color,
                rectangular_node rectangulars, sphere_node spheres,
                light_node lights, const viewpoint *view,
                int width, int height)
{
    point3 u, v, w, d;
    color object_color = { 0.0, 0.0, 0.0 };

    /* calculate u, v, w */
    calculateBasisVectors(u, v, w, view);

    idx_stack stk;

    int factor = sqrt(SAMPLES);
    for (int j = 0; j < height; j++) {
        for (int i = 0; i < width; i++) {
            double r = 0, g = 0, b = 0;
            /* MSAA */
	    #pragma omp parallel for firstprivate(r,g,b)
            for (int s = 0; s < SAMPLES; s++) {
                idx_stack_init(&stk);
                rayConstruction(d, u, v, w,
                                i * factor + s / factor,
                                j * factor + s % factor,
                                view,
                                width * factor, height * factor);
                if (ray_color(view->vrp, 0.0, d, &stk, rectangulars, spheres,
                              lights, object_color,
                              MAX_REFLECTION_BOUNCES)) {
                    r += object_color[0];
                    g += object_color[1];
                    b += object_color[2];
                } else {
                    r += background_color[0];
                    g += background_color[1];
                    b += background_color[2];
                }
                pixels[((i + (j * width)) * 3) + 0] = r * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 1] = g * 255 / SAMPLES;
                pixels[((i + (j * width)) * 3) + 2] = b * 255 / SAMPLES;
            }
        }
    }
}

speed up but check fail

 Performance counter stats for './raytracing' (10 runs):

            67,856      cache-misses              #    0.188 % of all cache refs      ( +-  2.44% )
        36,086,782      cache-references                                              ( +-  2.19% )
    16,662,901,112      instructions              #    0.91  insns per cycle          ( +-  5.07% )
    18,386,293,743      cycles                                                        ( +-  4.35% )

       1.601089390 seconds time elapsed                                          ( +-  4.40% )

-> data dependent.

4. loop unrolling + inline + omp


Loop unrolling raytracing() - 9/28 3pm

1. Loop unrolling 第三層迴圈

  • 在嘗試解決加入OpenMP前後相依造成的錯誤當中,觀察到SAMPLE只有4,又是同時存取r,g,b三個值,故將他的迴圈解開。之後,分別兩個嘗試
  • 1)把OpenMP留著 -> 速度快但check fail
 Performance counter stats for './raytracing' (10 runs):

           207,759      cache-misses              #    0.276 % of all cache refs      ( +-  2.99% )
        75,389,334      cache-references                                              ( +-  3.65% )
    12,420,744,790      instructions              #    0.72  insns per cycle          ( +-  0.15% )
    17,284,430,642      cycles                                                        ( +-  0.58% )

       1.931539834 seconds time elapsed                                          ( +-  1.13% )

  • 2)移除OpenMP -> 資料正確
 Performance counter stats for './raytracing' (10 runs):

           248,194      cache-misses              #   35.564 % of all cache refs      ( +-  3.97% )
           697,873      cache-references                                              ( +-  9.18% )
    13,320,768,291      instructions              #    1.69  insns per cycle          ( +-  0.00% )
     7,859,385,254      cycles                                                        ( +-  0.47% )

       2.606176699 seconds time elapsed                                          ( +-  0.38% )

2. call graph 檢查變數是否可能被多個thread存取

Omp success - 9/28 7pm

1. 成功把raytracing用omp平行,check成功

在Unroll 第三層迴圈之下,找出dependent的參數d, stk, object_color

2. perf stat

 Performance counter stats for './raytracing' (10 runs):

            74,722      cache-misses              #    0.097 % of all cache refs      ( +-  3.33% )
        77,274,246      cache-references                                              ( +-  5.31% )
    12,810,788,231      instructions              #    0.76  insns per cycle          ( +-  0.11% )
    16,853,158,191      cycles                                                        ( +-  0.42% )

       2.026637683 seconds time elapsed                                          ( +-  0.80% )

3. call gpraph

4. 與沒有inline dot_product比較 perf stat

 Performance counter stats for './raytracing' (10 runs):

           102,166      cache-misses              #    0.059 % of all cache refs      ( +-  8.70% )
       173,186,250      cache-references                                              ( +-  5.89% )
        63,646,408      branch-misses                                                 ( +-  1.08% )
    15,806,263,440      instructions              #    0.50  insns per cycle          ( +-  0.26% )
    31,412,416,665      cycles                                                        ( +-  1.00% )

       3.722181320 seconds time elapsed                                          ( +-  1.97% )

num_threads()的嘗試 9/28 - 11pm

1. 測試分配不同數量的threads對效能的影響

  • 500
 Performance counter stats for './raytracing' (10 runs):

           586,040      cache-misses              #    1.565 % of all cache refs      ( +-  1.72% )
        37,444,312      cache-references                                              ( +-  0.24% )
    12,194,796,576      instructions              #    0.85  insns per cycle          ( +-  0.01% )
    14,430,824,154      cycles                                                        ( +-  0.12% )

       1.266002949 seconds time elapsed                                          ( +-  0.29% )
  • 450
 Performance counter stats for './raytracing' (10 runs):

           571,745      cache-misses              #    1.523 % of all cache refs      ( +-  1.76% )
        37,550,691      cache-references                                              ( +-  0.42% )
    12,194,710,719      instructions              #    0.84  insns per cycle          ( +-  0.01% )
    14,506,164,926      cycles                                                        ( +-  0.38% )

       1.275681361 seconds time elapsed                                          ( +-  0.54% )
  • 256
 Performance counter stats for './raytracing' (10 runs):

           400,710      cache-misses              #    1.071 % of all cache refs      ( +-  5.59% )
        37,425,937      cache-references                                              ( +-  0.36% )
    12,182,627,168      instructions              #    0.84  insns per cycle          ( +-  0.01% )
    14,428,624,839      cycles                                                        ( +-  0.21% )

       1.277496560 seconds time elapsed                                          ( +-  0.50% )

  • 12
 Performance counter stats for './raytracing' (10 runs):

            93,895      cache-misses              #    0.167 % of all cache refs      ( +- 11.93% )
        56,274,093      cache-references                                              ( +-  1.03% )
    12,322,845,231      instructions              #    0.74  insns per cycle          ( +-  0.02% )
    16,560,790,058      cycles                                                        ( +-  0.33% )

       1.497703779 seconds time elapsed                                          ( +-  0.28% )

POSIX Thread

software pipelining

分析與效能優化策略

觀察

1. math-toolkit.h

  • 根據gprof的結果,發現主要耗時的是math-toolkit.h的4個函式,因此將會針對這4個函式進行優化

    1. Loop unrolling dot_product() & subtract_vector() : 減少不必要的跳出迴圈判斷
  • 發現multiply_vector()的執行時間大於multiply_vectors()

  • 運算式多為vector的運算,單個函式呼叫中資料彼此之前沒有關係 -> 用 OpenMP 把迴圈的運算平行化

  • inline是建議compiler函式的屬性,內嵌,與編譯的方法相關

2. main.c

3. raytracing.c

  • 把所有raytracing()呼叫到的函式檢查一遍是否有共用存取

    1. idx_stack.h ->
      把idx_stack_init inline (not yet)

    2. ray_color(const point3 e, double t,
      const point3 d,
      idx_stack *stk,
      const rectangular_node rectangulars,
      const sphere_node spheres,
      const light_node lights,
      color object_color, int bounces_left)
      add_vector(object_color, refraction_part, object_color);

    3. rayConstruction(d, u, v, w, i * factor + 0 / factor, j * factor + 0 % factor, view, width * factor, height * factor);

    subtract_vector(s, view->vrp, d);
    normalize(d);

    1. ray_hit_object(const point3 e, const point3 d, double t0, double t1, const rectangular_node rectangulars, rectangular_node *hit_rectangular, const sphere_node spheres, sphere_node *hit_sphere)
      *hit_sphere = sphere;
      *hit_rectangular = NULL;
    2. reflection(r, d, op.normal);
    3. refraction(rr, d, ip.normal, idx, idx_pass);

函式的屬性 - Attributes of Functions

__attribute__()可以在宣告一個函式時,標明函式的屬性,用法是:
void f () __attribute__ (attributeName) >這個可能參考資料寫錯了,因為compile時有出現錯誤訊息:
attributes should be specified before the declarator in a function definition static

種類

  • always_inline : 標明函式是要用內嵌,也就是不使用呼叫、跳去記憶體位置執行函式,而是在編譯時將函式在呼叫點展開。 (inline:建議,compiler不一定會照做,擺法也不太一樣,inline位在宣告時的函式名稱最前面)
    參考資料1
    參考資料2

Gprof

print the execution time of each function in the process.

  1. -pg > this option represents for profiling when compiling. Besides, it must be a compilation option as well as a link option.

  2. -Q > switch to suppress the printing of the call graph data.

  3. Run the program to generate the data gprof needs.
    gprof options [exe] [profile-data-file] > outfile
    If you omit the executable file name, the file a.out is used. If you give no profile data file name, the file gmon.out is used.

(Not yet):To link the program for profiling, if you use a compiler such as cc to do the linking, simply specify `-pg' in addition to your usual options.

gprof2dot

SIMD (Single instruction, multiple data)

AVX (Advanced Vector Extensions)

將大部分的整數運算指令擴展到 256 bits

Register

  • 16 個 YMM registers.
  • 8 個 32-bit 的單浮點數或 4 個 64-bit 的雙浮點數
  • XMM0 – XMM7 to YMM0 – YMM7 (in x86-64 mode, YMM0 – YMM15)

參考資料

SSE (Streaming SIMD Extensions)

C programming

sqrt() calculates the square root.

hugikun999 : apt install python-pip -> pip install gprof2dot