2016q3 Homework4 (raytracing)

# 2016q3 Homework4 (raytracing) contributed by <`hugikun999`>, <`ponsheng`> >>1.請補上作業要求 >>2.中英文關鍵字請以空白隔開 >>[color=red][name=課程助教] # prefetch >>參考王紹華和許耕福的[共筆](https://hackmd.io/GwBgHAzATFCcAsBaAZskBWR8DGB2EiARrsromFMaRNofLoUA) ### [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) * Two reason cause the difficult to prefetch (1)There are few rigorous guidelines on how best to insert prefetch intrinsics. (2)The complexity of the interaction between software and hardware prefetching is not well understood. * Seek answers to the following questions. (1) What are the limitations and overheads of software prefetching? (2) What are the limitations and overheads of hardware prefetching? (3) When is it beneficial to use software and/or hardware prefetching? * Use stream prefetcher, GHB prefetcher,and content-based prefetcher as hardware-based prefetching mechanisms. * prefetch classification ![](https://i.imgur.com/enw1kSS.png) * prefetch distance We define prefetch distance as the distance ahead of which a prefetch should be requested on a memory address. * Direct and Indirect Memory Indexing (1) direct memory easy prefetch by hardware. (2) indirect memory easy prefetch by software.It has higher overheads than direct memory. * Benefits of Software Prefetching over Hardware Prefetching (1) Large number of streams than hardware (2) Short stream:Hardware needs at least two cache misses to detech direction of a steam. (3) Irregular Memory Access:Software is easy to do. (4) Cache locality hint:There is greater flexibility of placing data in the right cache level. (5) Loop Bounds:loop unrolling, software pipelining, and using branch instructions to avoid exceeding size of array bounds. * Negative Impacts of Software Prefetching (1) Increased Instruction Count. (2) Static Insertion:compiler determines the data to be prefetched,can't change flexibility. (3) Code Structure Change:Need enough time to do prefetch.Such as loop splitting is required. * Synergistic Effects when Using Software and Hardware Prefetching Together (1) Handling Multiple Streams (2) Positive Training: Software can help train hardware prefetcher. * Antagonistic Effects when Using Software and Hardware Prefetching Together (1) Negative Training:software prefetch instructions can trigger overly aggressive hardware prefetches,which results in early requests. (2)Harmful Software Prefetching * Static Distance vs. Machine Configuration (1)The static prefetch distance variance does not impact performance significantly even if the machine configuration is changed at runtime. * Effect of Using Hardware and Software Prefetching Together Although there are some positive benefits,it is generally better not to train hardware prefetching with software prefetching requests. * Prefetch coverage Less coverage(the ratio of useful prefetchs to L2 misses) is the main reason for performance loss in the neutral and negative groups. * Problem (1) 在使用 hint s去指定要存進的 cache-level 時會期望T1和T2在資料有重覆的時候運作的比較好？ >> 不懂為何在資料重覆時T1和T2會比較好 [name=俊逸] 原文： `Where we might expect the T1 and T2 hints to work well is in the case of streaming applications that have little data reuse.` # SIMD (1) 利用 AVX 將`add_vector`收到的 a 和 b 放入`ymm0`和`ymm1`運算，中間利用 mask 將其中一個 double 給設定為0，因為只會用到三個 double 。跟原版比較起來沒有慢了一些。 ``` Execution time of raytracing() : 0.610198 sec Verified OK ``` ```C __m256i mask = _mm256_set_epi64x(0x0000000000000000, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff); __m256d c = _mm256_loadu_pd(a); __m256d d = _mm256_loadu_pd(b); __m256d dst = _mm256_add_pd(c, d); _mm256_maskstore_pd(out, mask, dst); ``` (2) 宣告時使用`register`，其意是告訴 compiler 宣告的東西放進暫存器運算，類似組語的模式，可以加速運算，但仍比原板慢。 ```C register __m256d ymm0, ymm1; __m256i mask = _mm256_set_epi64x(0x0000000000000000, 0xffffffffffffffff, 0xffffffffffffffff, 0xffffffffffffffff); ymm0 = _mm256_loadu_pd(a); ymm1 = _mm256_loadu_pd(b); ymm0 = _mm256_add_pd(ymm0, ymm1); _mm256_maskstore_pd(out, mask, ymm0); ``` ``` Execution time of raytracing() : 0.567389 sec Verified OK ``` (3) 更改`math-toolkit.h`中其它的 function 都會造成反效果。推測沒有比較快是因為輸入的參數要先放進`ymm`中才進行計算，計算完又要將`ymm`的東西放回`out`，沒有大量的計算所以無法突顯一次計算4個 double 的特性，如果可以將每個 pixel 的計算整合起來放進 ymm 做一次性的計算或許就可以提升效能。更改`dot_product` ``` Execution time of raytracing() : 0.746257 sec Verified OK ``` 更改`multiply_vector` ``` Execution time of raytracing() : 0.661732 sec Verified OK ``` 更改`multiply_vectors` ``` Execution time of raytracing() : 0.642653 sec Verified OK ``` 更改`subtract_vector` ``` Execution time of raytracing() : 0.602767 sec Verified OK ``` (4) 想將`scalar_triple_product()`中間所做的`cross_product()`和`multiply_vectors()`都利用 AVX 處理，看之前的推測(3所提到的)是否正確，但由於`cross_product()`是交叉著相乘，所以利用`_mm256_shuffle_pd()`處裡，但是由於此function的限制打算改用`_mm256_shuffle_ps()`，因此必需更改型態，但不確定將原本放進`__m256d`的東西改放進`__m256`會不會造成之後的運算錯誤。 > 不行這樣做，後來看到AVX讀取的function有指定形態，不能單純用bit的想法去思考。 [name=俊逸] (5) 延續(4),最後直接將`scalar_triple_product()`直接改成不呼叫`cross_product()`和`multiply_vectors()`，而是將兩個function的實做全部用AVX取代直接寫進`scalar_triple_product()`，由結果可以看出來，(3)的推測是正確的。原版 ``` Performance counter stats for './raytracing' (25 runs): 817,055 cache-misses ( +- 1.61% ) 3,585,015 branch-misses ( +- 0.76% ) 11,927,206,637 instructions ( +- 0.01% ) 1.116747000 seconds time elapsed ( +- 0.14% ) ``` AVX版 ``` Performance counter stats for './raytracing' (25 runs): 849,830 cache-misses ( +- 2.09% ) 3,576,107 branch-misses ( +- 0.68% ) 11,928,956,505 instructions ( +- 0.01% ) 1.114582907 seconds time elapsed ( +- 0.11% ) ``` (6) 延續(5)將`math-toolkit.h`的`scalar_triple()`的兩個function call也改成AVX版。這次優化的幅度比較大，可以看到cache miss少了一半，可能是因為把需要用到的data都放進register，接下來的運算不會再用到其它的data所以不會有cache miss的問題。 ``` Performance counter stats for 'make check' (25 runs): 417,793 cache-misses ( +- 2.75% ) 3,439,375 branch-misses ( +- 0.52% ) 11,929,304,004 instructions ( +- 0.00% ) 1.079552216 seconds time elapsed ( +- 0.09% ) ``` # pthread > 參考陳品睿同學的[共筆](https://embedded2016.hackpad.com/2016q1-Homework-2A-bLGQtRraTES) > 要能減少lock/atomic contention, 其中一個策略就是使用thread local variable。這篇比較了傳統pthread的實作，以及較新且需要linker支援的__thread 的性能差異。 [link](https://blogs.oracle.com/seongbae/entry/pthread_get_set_specific_vs) > 研究C11 threading 可以看到raytrcing function負責填入每個pixel的RGB，可以針對這個函式做pthread切割 ``` void raytracing(uint8_t *pixels, color background_color, rectangular_node rectangulars, sphere_node spheres, light_node lights, const viewpoint *view, int width, int height) ``` 定義新的結構 ``` typedef struct { uint8_t *pixels; color background_color; rectangular_node rectangulars; sphere_node spheres; light_node lights; viewpoint *view; int width; int height; } arg_t ``` 從同學共筆中，可以看到他用不同方法去切割圖片無論用什麼方式切割，使用thread希望能達到 1.每個thread時間分配均勻 2.降低多餘成本 ## 測試環境： * Intel i3，2core，4執行序 ## 1.一個thread負責一個row ```clike= for (int j = my_id ; j < height; j+=THREADS) { for (int i = 0; i < width; i++) { /*process*/ } } ``` 在不同的thread數之執行時間 ![](https://i.imgur.com/tmg9PwE.png) ## 2.一個thread負責一個column ```clike= for (int i = my_id ; i < width; i+=THREADS) { for (int j = 0; j < height; j++) { /*process*/ } } ``` 在不同的thread數之執行時間 ![](https://i.imgur.com/ROoA0fC.png) 本來想探討locality造成的cache miss，但是發現raytracing並無load連續資料，只有存放連續資料，而save資料的方式是先存到cache某個地方，再存到memory的相對位置，無cache miss issue。 > 需了解儲存的流程 ## 3.fast_first 最先做完工作的thread先拿下一條來算，可以達到最理想的thread等時間運算 Use Atomic Operation ```clike= int j=__sync_fetch_and_add(&comm_num,1); while(j < height) { for (int i = 0; i < width; i++) { /*process*/ } j=__sync_fetch_and_add(&comm_num,1); } ``` 三個方法比較(4 threads) ![](https://i.imgur.com/a1BhmJp.png) 加入Openmp與原版比較(4 threads) openmp一樣都是分割thread，去加速程式，但是手動改pthread還是能達到比較快的速度 ![](https://i.imgur.com/RN76AbO.png) # C11 thread > C11 又稱C1X :::info Detailed memory model to better support multiple threads of execution Multi-threading support (_Thread_local storage-class specifier, `<threads.h>` header including **thread creation/management functions, mutex, condition variable and thread-specific storage functionality**, as well as the **_Atomic type qualifier** and **`<stdatomic.h>` for uninterruptible object access**) ::: [C11 multi-thread Stack Overflow discussion](http://stackoverflow.com/questions/8876043/multi-threading-support-in-c11) * C11的concurency的memory models實作是由C++11引入，基本上只有一些語法差異 [C11 ref](http://www.open-std.org/jtc1/sc22/wg14/www/docs/n1570.pdf) [C++11 multi-thread Stack Overflow discussion](http://stackoverflow.com/questions/6319146/c11-introduced-a-standardized-memory-model-what-does-it-mean-and-how-is-it-g) `C++11` introduced a standardized memory model Herb Sutter says [here](http://www.theregister.co.uk/2011/06/11/herb_sutter_next_c_plus_plus?page=2): :::warning The memory model means that C++ code now has a **standardized library to call regardless of who made the compiler and on what platform it's running**. There's a standard way to control how different threads talk to the processor's memory. "When you are talking about splitting [code] across different cores that's in the standard, we are talking about the memory model. We are going to optimize it without breaking the following assumptions people are going to make in the code," Sutter said. ::: 希望能使程式碼能在任何OS，任何硬體上執行(透過abstract machine) * 在C\++98/C++03，沒有定義多執行序呼叫，只有定義單線程(更別說mutex等機制) * 但各系統可以有各自的implement,(ex:pthread,windows thread)，C++11才開始統一 `The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory` ## 程式撰寫(C11) ```clike= #include <threads.h> thrd_t threadId; //thread id type int thrd_create(thrd_t *thr, thrd_start_t func, void *arg); //thread create function int thrd_join(thrd_t thr, int *res); //thread join function ``` ### 編譯問題 `fatal error: threads.h: No such file or directory` [GCC C11status](https://gcc.gnu.org/wiki/C11Status) -> Threading [Optional] | Library issue (**not implemented**) :::warning __STDC_NO_THREADS__The integer constant 1, intended to indicate that the implementation does not support the <threads.h> header. ::: * \_\_STDC_NO_THREADS__ 印出為1 * gcc, clang 目前不支援C11 threading * 可以用g++ 編C++11的 thread > pthread vs windows thread vs C11 thread ？ There are multiple parts of C++ 2011 multi-threading: Higher-level abstractions like std::thread, std::mutex, std::condition_variable, etc. These abstractions are implemented in terms of pthreads for both libc++ (clang's native library) and libstdc++ (gcc's native library). libstdc++ uses an indirection (gthr.h) which can be used to, e.g., stub things out for a single threaded implementation. This is quite obvious from the source of the different synchronization classes. The lower-level synchronization facilities, i.e., atomics and the various memory visibility controls, are not available from pthreads. It seems both gcc and clang implement these using compiler build-ins which probably create suitable instructions. I haven't tracked down the actual code for either of these, however. It isn't sufficient to implement things in the library: the compiler needs to be prevented from reordering instruction across synchronization primitives and it needs to make values visible in appropriate locations. # other ## CUDA ### [CUDA QUICK START GUIDE](http://developer.download.nvidia.com/compute/cuda/8.0/secure/prod/docs/sidebar/CUDA_Quick_Start_Guide.pdf?autho=1476973569_c479f9ca345ccfa7c857ecd4a28ab249&file=CUDA_Quick_Start_Guide.pdf) (1)Linux armhf: Debian provides three ARM architectures,`soft`and`armel` and `armhf`.The three names are protocols about floating point computation.For `armel`,before computing floating points,floating points need to be passed through integer register,then transfered to [FPU](https://en.wikipedia.org/wiki/Floating-point_unit).For `armhf`,floating points just be passed to FPU.For `soft`,doesn't use floating points computation. [參考網站](http://www.veryarm.com/872.html) (2)Linux aarch64: The operation state of ARMv8 divides into `AArch64` and `AArch32`.The two are mutual exclusive ISA. [參考網站](http://wiki.csie.ncku.edu.tw/embedded/ARMv8) (3)Linux POWER8: One kind of architecture is designed to be a massively multithreaded chip.It can performance well with intensive computation and intensive data. [參考網站](https://en.wikipedia.org/wiki/Power_Architecture) ## Code Refactoring (1) 在`raytracing.c`中的`raytracing()`中宣告了`point3 u,v,w`，但是這三個東西是copy view的值去做`normalize()`紀錄之後就不會再改變的東西，所以可以把它們放在整個檔案的最前面並宣告成全域`static`的形式，這樣在叫用`rayConstruction()`的時後就不必丟入`u,v,w`造成傳遞上的浪費。從結果看來並沒有實質上的差異，也有可能是基底不夠大。更改前 ``` Performance counter stats for './raytracing' (25 runs): 487,718 cache-misses ( +- 1.97% ) 3,468,131 branch-misses ( +- 0.54% ) 11,922,768,865 instructions ( +- 0.00% ) 1.086883184 seconds time elapsed ( +- 0.12% ) ``` 更改後 ``` Performance counter stats for './raytracing' (25 runs): 495,401 cache-misses ( +- 1.19% ) 3,492,891 branch-misses ( +- 0.96% ) 11,905,723,270 instructions ( +- 0.00% ) 1.085519004 seconds time elapsed ( +- 0.09% ) ```