Try   HackMD

2016q3 Homework4 (raytracing)

contributed by <hugikun999>, <ponsheng>

1.請補上作業要求
2.中英文關鍵字請以空白隔開
課程助教

prefetch

參考王紹華和許耕福的共筆

When Prefetching Works, When It Doesn’t, and Why

  • Two reason cause the difficult to prefetch
    (1)There are few rigorous guidelines on how best to insert prefetch intrinsics.
    (2)The complexity of the interaction between software and hardware prefetching is not well understood.

  • Seek answers to the following questions.
    (1) What are the limitations and overheads of software prefetching?
    (2) What are the limitations and overheads of hardware prefetching?
    (3) When is it beneficial to use software and/or hardware prefetching?

  • Use stream prefetcher, GHB prefetcher,and content-based prefetcher as hardware-based prefetching mechanisms.

  • prefetch classification

  • prefetch distance
    We define prefetch distance as the distance ahead of which a prefetch should be requested on a memory address.

  • Direct and Indirect Memory Indexing
    (1) direct memory easy prefetch by hardware.
    (2) indirect memory easy prefetch by software.It has higher overheads than direct memory.

  • Benefits of Software Prefetching over Hardware Prefetching
    (1) Large number of streams than hardware
    (2) Short stream:Hardware needs at least two cache misses to detech direction of a steam.
    (3) Irregular Memory Access:Software is easy to do.
    (4) Cache locality hint:There is greater flexibility of placing data in the right cache level.
    (5) Loop Bounds:loop unrolling, software pipelining, and using
    branch instructions to avoid exceeding size of array bounds.

  • Negative Impacts of Software Prefetching
    (1) Increased Instruction Count.
    (2) Static Insertion:compiler determines the data to be prefetched,can't change flexibility.
    (3) Code Structure Change:Need enough time to do prefetch.Such as loop splitting is required.

  • Synergistic Effects when Using Software and Hardware Prefetching Together
    (1) Handling Multiple Streams
    (2) Positive Training: Software can help train hardware prefetcher.

  • Antagonistic Effects when Using Software and Hardware Prefetching Together
    (1) Negative Training:software prefetch instructions can trigger overly aggressive hardware prefetches,which results in early requests.
    (2)Harmful Software Prefetching

  • Static Distance vs. Machine Configuration
    (1)The static prefetch distance variance does not impact performance significantly even if the machine configuration is changed at runtime.

  • Effect of Using Hardware and Software Prefetching Together
    Although there are some positive benefits,it is generally better not to train hardware prefetching with software prefetching requests.

  • Prefetch coverage
    Less coverage(the ratio of useful prefetchs to L2 misses) is the main reason for performance loss in the neutral and negative groups.

  • Problem
    (1)
    在使用 hint s去指定要存進的 cache-level 時會期望T1和T2在資料有重覆的時候運作的比較好?

不懂為何在資料重覆時T1和T2會比較好 俊逸

原文:
Where we might expect the T1 and T2 hints to work well is in the case of streaming applications that have little data reuse.

SIMD

(1)
利用 AVX 將add_vector收到的 a 和 b 放入ymm0ymm1運算,中間利用 mask 將其中一個 double 給設定為0,因為只會用到三個 double 。跟原版比較起來沒有慢了一些。

Execution time of raytracing() : 0.610198 sec
Verified OK
__m256i mask = _mm256_set_epi64x(0x0000000000000000, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff);
    __m256d c = _mm256_loadu_pd(a);
    __m256d d = _mm256_loadu_pd(b);
    __m256d dst = _mm256_add_pd(c, d);
    _mm256_maskstore_pd(out, mask, dst);

(2)
宣告時使用register,其意是告訴 compiler 宣告的東西放進暫存器運算,類似組語的模式,可以加速運算,但仍比原板慢。

	register __m256d ymm0, ymm1;	
	__m256i mask = _mm256_set_epi64x(0x0000000000000000, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff); 
    ymm0 = _mm256_loadu_pd(a);
    ymm1 = _mm256_loadu_pd(b);
	ymm0 = _mm256_add_pd(ymm0, ymm1);
	_mm256_maskstore_pd(out, mask, ymm0);
Execution time of raytracing() : 0.567389 sec
Verified OK

(3)
更改math-toolkit.h中其它的 function 都會造成反效果。推測沒有比較快是因為
輸入的參數要先放進ymm中才進行計算,計算完又要將ymm的東西放回out,沒有大量的計算所以無法突顯一次計算4個 double 的特性,如果可以將每個 pixel 的計算整合起來放進 ymm 做一次性的計算或許就可以提升效能。

更改dot_product

Execution time of raytracing() : 0.746257 sec
Verified OK

更改multiply_vector

Execution time of raytracing() : 0.661732 sec
Verified OK

更改multiply_vectors

Execution time of raytracing() : 0.642653 sec
Verified OK

更改subtract_vector

Execution time of raytracing() : 0.602767 sec
Verified OK

(4)
想將scalar_triple_product()中間所做的cross_product()multiply_vectors()都利用 AVX 處理,看之前的推測(3所提到的)是否正確,但由於cross_product()是交叉著相乘,所以利用_mm256_shuffle_pd()處裡,但是由於此function的限制打算改用_mm256_shuffle_ps(),因此必需更改型態,但不確定將原本放進__m256d的東西改放進__m256會不會造成之後的運算錯誤。

不行這樣做,後來看到AVX讀取的function有指定形態,不能單純用bit的想法去思考。 俊逸

(5)
延續(4),最後直接將scalar_triple_product()直接改成不呼叫cross_product()multiply_vectors(),而是將兩個function的實做全部用AVX取代直接寫進scalar_triple_product(),由結果可以看出來,(3)的推測是正確的。

原版

 Performance counter stats for './raytracing' (25 runs):

           817,055      cache-misses                                                  ( +-  1.61% )
         3,585,015      branch-misses                                                 ( +-  0.76% )
    11,927,206,637      instructions                                                  ( +-  0.01% )

       1.116747000 seconds time elapsed                                          ( +-  0.14% )

AVX版

 Performance counter stats for './raytracing' (25 runs):

           849,830      cache-misses                                                  ( +-  2.09% )
         3,576,107      branch-misses                                                 ( +-  0.68% )
    11,928,956,505      instructions                                                  ( +-  0.01% )

       1.114582907 seconds time elapsed                                          ( +-  0.11% )

(6)
延續(5)將math-toolkit.hscalar_triple()的兩個function call也改成AVX版。這次優化的幅度比較大,可以看到cache miss少了一半,可能是因為把需要用到的data都放進register,接下來的運算不會再用到其它的data所以不會有cache miss的問題。

 Performance counter stats for 'make check' (25 runs):

           417,793      cache-misses                                                  ( +-  2.75% )
         3,439,375      branch-misses                                                 ( +-  0.52% )
    11,929,304,004      instructions                                                  ( +-  0.00% )

       1.079552216 seconds time elapsed                                          ( +-  0.09% )

pthread

參考 陳品睿同學的共筆

要能減少lock/atomic contention, 其中一個策略就是使用thread local variable。這篇比較了傳統pthread的實作,以及較新且需要linker支援的__thread 的性能差異。 link

研究C11 threading

可以看到raytrcing function負責填入每個pixel的RGB,可以針對這個函式做pthread切割

void raytracing(uint8_t *pixels, color background_color,
                rectangular_node rectangulars, sphere_node spheres,
                light_node lights, const viewpoint *view,
                int width, int height)

定義新的結構

typedef struct {
	uint8_t *pixels;
	color background_color;
	rectangular_node rectangulars;
	sphere_node spheres;
	light_node lights;
	viewpoint *view;
	int width; int height;

} arg_t

從同學共筆中,可以看到他用不同方法去切割圖片
無論用什麼方式切割,使用thread希望能達到
1.每個thread時間分配均勻
2.降低多餘成本

測試環境:

  • Intel i3,2core,4執行序

1.一個thread負責一個row

for (int j = my_id ; j < height; j+=THREADS) { for (int i = 0; i < width; i++) { /*process*/ } }

在不同的thread數之執行時間

2.一個thread負責一個column

for (int i = my_id ; i < width; i+=THREADS) { for (int j = 0; j < height; j++) { /*process*/ } }

在不同的thread數之執行時間

本來想探討locality造成的cache miss,但是發現raytracing並無load連續資料,只有存放連續資料,而save資料的方式是先存到cache某個地方,再存到memory的相對位置,無cache miss issue。

需了解儲存的流程

3.fast_first

最先做完工作的thread先拿下一條來算,可以達到最理想的thread等時間運算
Use Atomic Operation

int j=__sync_fetch_and_add(&comm_num,1); while(j < height) { for (int i = 0; i < width; i++) { /*process*/ } j=__sync_fetch_and_add(&comm_num,1); }

三個方法比較(4 threads)

加入Openmp與原版比較(4 threads)
openmp一樣都是分割thread,去加速程式,但是手動改pthread還是能達到比較快的速度

C11 thread

C11 又稱C1X

Detailed memory model to better support multiple threads of execution

Multi-threading support (_Thread_local storage-class specifier, <threads.h> header including thread creation/management functions, mutex, condition variable and thread-specific storage functionality, as well as the _Atomic type qualifier and <stdatomic.h> for uninterruptible object access)

C11 multi-thread Stack Overflow discussion

  • C11的concurency的memory models實作是由C++11引入,基本上只有一些語法差異
    C11 ref

C++11 multi-thread Stack Overflow discussion
C++11 introduced a standardized memory model

Herb Sutter says here:

The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it's running. There's a standard way to control how different threads talk to the processor's memory.

"When you are talking about splitting [code] across different cores that's in the standard, we are talking about the memory model. We are going to optimize it without breaking the following assumptions people are going to make in the code," Sutter said.

希望能使程式碼能在任何OS,任何硬體上執行(透過abstract machine)

  • 在C++98/C++03,沒有定義多執行序呼叫,只有定義單線程(更別說mutex等機制)
  • 但各系統可以有各自的implement,(ex:pthread,windows thread),C++11才開始統一
    The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory

程式撰寫(C11)

#include <threads.h> thrd_t threadId; //thread id type int thrd_create(thrd_t *thr, thrd_start_t func, void *arg); //thread create function int thrd_join(thrd_t thr, int *res); //thread join function

編譯問題

fatal error: threads.h: No such file or directory

GCC C11status
-> Threading [Optional] | Library issue (not implemented)

__STDC_NO_THREADS__The integer constant 1, intended to indicate that the implementation does not support the <threads.h> header.

  • __STDC_NO_THREADS__ 印出為1
  • gcc, clang 目前不支援C11 threading
  • 可以用g++ 編C++11的 thread

pthread vs windows thread vs C11 thread ?
There are multiple parts of C++ 2011 multi-threading:

Higher-level abstractions like std::thread, std::mutex, std::condition_variable, etc. These abstractions are implemented in terms of pthreads for both libc++ (clang's native library) and libstdc++ (gcc's native library). libstdc++ uses an indirection (gthr.h) which can be used to, e.g., stub things out for a single threaded implementation. This is quite obvious from the source of the different synchronization classes.
The lower-level synchronization facilities, i.e., atomics and the various memory visibility controls, are not available from pthreads. It seems both gcc and clang implement these using compiler build-ins which probably create suitable instructions. I haven't tracked down the actual code for either of these, however.
It isn't sufficient to implement things in the library: the compiler needs to be prevented from reordering instruction across synchronization primitives and it needs to make values visible in appropriate locations.

other

CUDA

CUDA QUICK START GUIDE

(1)Linux armhf:
Debian provides three ARM architectures,softandarmel and armhf.The three names are protocols about floating point computation.For armel,before computing floating points,floating points need to be passed through integer register,then transfered to FPU.For armhf,floating points just be passed to FPU.For soft,doesn't use floating points computation.
參考網站

(2)Linux aarch64:
The operation state of ARMv8 divides into AArch64 and AArch32.The two are mutual exclusive ISA.
參考網站

(3)Linux POWER8:
One kind of architecture is designed to be a massively multithreaded chip.It can performance well with intensive computation and intensive data.
參考網站

Code Refactoring

(1)
raytracing.c中的raytracing()中宣告了point3 u,v,w,但是這三個東西是copy view的值去做normalize()紀錄之後就不會再改變的東西,所以可以把它們放在整個檔案的最前面並宣告成全域static的形式,這樣在叫用rayConstruction()的時後就不必丟入u,v,w造成傳遞上的浪費。從結果看來並沒有實質上的差異,也有可能是基底不夠大。

更改前

 Performance counter stats for './raytracing' (25 runs):

           487,718      cache-misses                                                  ( +-  1.97% )
         3,468,131      branch-misses                                                 ( +-  0.54% )
    11,922,768,865      instructions                                                  ( +-  0.00% )

       1.086883184 seconds time elapsed                                          ( +-  0.12% )

更改後

 Performance counter stats for './raytracing' (25 runs):

           495,401      cache-misses                                                  ( +-  1.19% )
         3,492,891      branch-misses                                                 ( +-  0.96% )
    11,905,723,270      instructions                                                  ( +-  0.00% )

       1.085519004 seconds time elapsed                                          ( +-  0.09% )