contributed by <hugikun999
>, <ponsheng
>
1.請補上作業要求
2.中英文關鍵字請以空白隔開
課程助教
參考王紹華和許耕福的共筆
Two reason cause the difficult to prefetch
(1)There are few rigorous guidelines on how best to insert prefetch intrinsics.
(2)The complexity of the interaction between software and hardware prefetching is not well understood.
Seek answers to the following questions.
(1) What are the limitations and overheads of software prefetching?
(2) What are the limitations and overheads of hardware prefetching?
(3) When is it beneficial to use software and/or hardware prefetching?
Use stream prefetcher, GHB prefetcher,and content-based prefetcher as hardware-based prefetching mechanisms.
prefetch classification
prefetch distance
We define prefetch distance as the distance ahead of which a prefetch should be requested on a memory address.
Direct and Indirect Memory Indexing
(1) direct memory easy prefetch by hardware.
(2) indirect memory easy prefetch by software.It has higher overheads than direct memory.
Benefits of Software Prefetching over Hardware Prefetching
(1) Large number of streams than hardware
(2) Short stream:Hardware needs at least two cache misses to detech direction of a steam.
(3) Irregular Memory Access:Software is easy to do.
(4) Cache locality hint:There is greater flexibility of placing data in the right cache level.
(5) Loop Bounds:loop unrolling, software pipelining, and using
branch instructions to avoid exceeding size of array bounds.
Negative Impacts of Software Prefetching
(1) Increased Instruction Count.
(2) Static Insertion:compiler determines the data to be prefetched,can't change flexibility.
(3) Code Structure Change:Need enough time to do prefetch.Such as loop splitting is required.
Synergistic Effects when Using Software and Hardware Prefetching Together
(1) Handling Multiple Streams
(2) Positive Training: Software can help train hardware prefetcher.
Antagonistic Effects when Using Software and Hardware Prefetching Together
(1) Negative Training:software prefetch instructions can trigger overly aggressive hardware prefetches,which results in early requests.
(2)Harmful Software Prefetching
Static Distance vs. Machine Configuration
(1)The static prefetch distance variance does not impact performance significantly even if the machine configuration is changed at runtime.
Effect of Using Hardware and Software Prefetching Together
Although there are some positive benefits,it is generally better not to train hardware prefetching with software prefetching requests.
Prefetch coverage
Less coverage(the ratio of useful prefetchs to L2 misses) is the main reason for performance loss in the neutral and negative groups.
Problem
(1)
在使用 hint s去指定要存進的 cache-level 時會期望T1和T2在資料有重覆的時候運作的比較好?
不懂為何在資料重覆時T1和T2會比較好 俊逸
原文:
Where we might expect the T1 and T2 hints to work well is in the case of streaming applications that have little data reuse.
(1)
利用 AVX 將add_vector
收到的 a 和 b 放入ymm0
和ymm1
運算,中間利用 mask 將其中一個 double 給設定為0,因為只會用到三個 double 。跟原版比較起來沒有慢了一些。
(2)
宣告時使用register
,其意是告訴 compiler 宣告的東西放進暫存器運算,類似組語的模式,可以加速運算,但仍比原板慢。
(3)
更改math-toolkit.h
中其它的 function 都會造成反效果。推測沒有比較快是因為
輸入的參數要先放進ymm
中才進行計算,計算完又要將ymm
的東西放回out
,沒有大量的計算所以無法突顯一次計算4個 double 的特性,如果可以將每個 pixel 的計算整合起來放進 ymm 做一次性的計算或許就可以提升效能。
更改dot_product
更改multiply_vector
更改multiply_vectors
更改subtract_vector
(4)
想將scalar_triple_product()
中間所做的cross_product()
和multiply_vectors()
都利用 AVX 處理,看之前的推測(3所提到的)是否正確,但由於cross_product()
是交叉著相乘,所以利用_mm256_shuffle_pd()
處裡,但是由於此function的限制打算改用_mm256_shuffle_ps()
,因此必需更改型態,但不確定將原本放進__m256d
的東西改放進__m256
會不會造成之後的運算錯誤。
不行這樣做,後來看到AVX讀取的function有指定形態,不能單純用bit的想法去思考。 俊逸
(5)
延續(4),最後直接將scalar_triple_product()
直接改成不呼叫cross_product()
和multiply_vectors()
,而是將兩個function的實做全部用AVX取代直接寫進scalar_triple_product()
,由結果可以看出來,(3)的推測是正確的。
原版
AVX版
(6)
延續(5)將math-toolkit.h
的scalar_triple()
的兩個function call也改成AVX版。這次優化的幅度比較大,可以看到cache miss少了一半,可能是因為把需要用到的data都放進register,接下來的運算不會再用到其它的data所以不會有cache miss的問題。
參考 陳品睿同學的共筆
要能減少lock/atomic contention, 其中一個策略就是使用thread local variable。這篇比較了傳統pthread的實作,以及較新且需要linker支援的__thread 的性能差異。 link
研究C11 threading
可以看到raytrcing function負責填入每個pixel的RGB,可以針對這個函式做pthread切割
定義新的結構
從同學共筆中,可以看到他用不同方法去切割圖片
無論用什麼方式切割,使用thread希望能達到
1.每個thread時間分配均勻
2.降低多餘成本
在不同的thread數之執行時間
在不同的thread數之執行時間
本來想探討locality造成的cache miss,但是發現raytracing並無load連續資料,只有存放連續資料,而save資料的方式是先存到cache某個地方,再存到memory的相對位置,無cache miss issue。
需了解儲存的流程
最先做完工作的thread先拿下一條來算,可以達到最理想的thread等時間運算
Use Atomic Operation
三個方法比較(4 threads)
加入Openmp與原版比較(4 threads)
openmp一樣都是分割thread,去加速程式,但是手動改pthread還是能達到比較快的速度
C11 又稱C1X
Detailed memory model to better support multiple threads of execution
Multi-threading support (_Thread_local storage-class specifier, <threads.h>
header including thread creation/management functions, mutex, condition variable and thread-specific storage functionality, as well as the _Atomic type qualifier and <stdatomic.h>
for uninterruptible object access)
C11 multi-thread Stack Overflow discussion
C++11 multi-thread Stack Overflow discussion
C++11
introduced a standardized memory model
Herb Sutter says here:
The memory model means that C++ code now has a standardized library to call regardless of who made the compiler and on what platform it's running. There's a standard way to control how different threads talk to the processor's memory.
"When you are talking about splitting [code] across different cores that's in the standard, we are talking about the memory model. We are going to optimize it without breaking the following assumptions people are going to make in the code," Sutter said.
希望能使程式碼能在任何OS,任何硬體上執行(透過abstract machine)
The abstract machine in C++11 is multi-threaded by design. It also has a well-defined memory model; that is, it says what the compiler may and may not do when it comes to accessing memory
fatal error: threads.h: No such file or directory
GCC C11status
-> Threading [Optional] | Library issue (not implemented)
__STDC_NO_THREADS__The integer constant 1, intended to indicate that the implementation does not support the <threads.h> header.
pthread vs windows thread vs C11 thread ?
There are multiple parts of C++ 2011 multi-threading:
Higher-level abstractions like std::thread, std::mutex, std::condition_variable, etc. These abstractions are implemented in terms of pthreads for both libc++ (clang's native library) and libstdc++ (gcc's native library). libstdc++ uses an indirection (gthr.h) which can be used to, e.g., stub things out for a single threaded implementation. This is quite obvious from the source of the different synchronization classes.
The lower-level synchronization facilities, i.e., atomics and the various memory visibility controls, are not available from pthreads. It seems both gcc and clang implement these using compiler build-ins which probably create suitable instructions. I haven't tracked down the actual code for either of these, however.
It isn't sufficient to implement things in the library: the compiler needs to be prevented from reordering instruction across synchronization primitives and it needs to make values visible in appropriate locations.
(1)Linux armhf:
Debian provides three ARM architectures,soft
andarmel
and armhf
.The three names are protocols about floating point computation.For armel
,before computing floating points,floating points need to be passed through integer register,then transfered to FPU.For armhf
,floating points just be passed to FPU.For soft
,doesn't use floating points computation.
參考網站
(2)Linux aarch64:
The operation state of ARMv8 divides into AArch64
and AArch32
.The two are mutual exclusive ISA.
參考網站
(3)Linux POWER8:
One kind of architecture is designed to be a massively multithreaded chip.It can performance well with intensive computation and intensive data.
參考網站
(1)
在raytracing.c
中的raytracing()
中宣告了point3 u,v,w
,但是這三個東西是copy view的值去做normalize()
紀錄之後就不會再改變的東西,所以可以把它們放在整個檔案的最前面並宣告成全域static
的形式,這樣在叫用rayConstruction()
的時後就不必丟入u,v,w
造成傳遞上的浪費。從結果看來並沒有實質上的差異,也有可能是基底不夠大。
更改前
更改後