![Uploading file..._dzo4hpcml]() # 2017q3 Homework2 (software-pipelining) contributed by <`tina0405`> ## 閱讀論文 題目: [When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf) ### 1.Introduce * 介紹 prefetch 近況 : * Although a variety of software and hardware prefetching mechanisms for tolerating cache miss latency exist,relatively simple software prefetching algorithms have appeared in state-of-the-art compilers like icc and gcc. * insert prefetching intrinsics 存在的兩個問題 : * few rigorous guidelines on how best to insert prefetch intrinsics. * the complexity of the interaction between software and hardware prefetching is not well understood. --- 名詞解釋: * Cache prefetching 參考 [wiki]((https://en.wikipedia.org/wiki/Cache_prefetching#Stream_buffers)) : fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed * Intrinsics 先參考了 [SSE 介紹](https://www.csie.ntu.edu.tw/~r89004/hive/sse/page_3.html) ~~~ 裏面提到它是類似一般的函數,但是實際上會被 compiler 直接編譯成組合語言的東西,例如 像這種類型宣告: __m128 data1, data2;//128-bit 如果是 SSE 的 intrinsics 的基本的型式為: _mm_<opcode>_<suffix> <opcode> 是指令的類別,像是 add、sub 等等。 <suffix> 則是資料的種類 SSE(Streaming SIMD Extensions) : 是由 Intel 公司,在 1999 年推出 Pentium III 處理器時,同時推出的新指令集。 如同其名稱所表示的,SSE 是一種 SIMD 指令集。 SIMD(single instruction, multiple data) : 也就是 一個指令同時對多個資料進行相同的動作。 ~~~ * 又參考了 [SSE (Streaming SIMD Extentions)](http://www.songho.ca/misc/sse/sse.html) 先初步解釋論文中提到的 "insert prefetch intrinsics" ~~~ 這邊寫 prefetch 指令可以將要用的資料提前 fetch 到 L1 and/or L2 cache prefetch instructions : The prefetch instructions provide cache hints to fetch data to the L1 and/or L2 cache before the program actually needs the data. 包含其他從 memory 搬到 cache 的支援 : prefetcht0: move the data from m11emory to L1 and L2 caches using t0 hint. prefetcht1: move the data from memory to L2 cache using t1 hint. prefetchnta: move non-temporal aligned data from memory to L1 cache directly (bypass L2). ~~~ --- * 主要研究方向 * provide guidelines for inserting prefetch intrinsics in the presence of a hardware prefetcher. * automated cooperative software/hardware prefetching schemes. ![](https://i.imgur.com/RP7RgKU.png) * baseline binaries that have only compiler inserted software prefetchers (沒有硬體) * positive : software prefetching in excess of 5% . * negative : degrade performance more than 5% by using software prefetching. * Summary of the software and hardware prefetch * GHB 和 STR 為硬體,也就是以上做出幾種劃分, SW/HW/SW+HW 來比較差異性: * STR(Stream Buffers) 參考了 [hardware prefetching](https://en.wikipedia.org/wiki/Cache_prefetching#Stream_buffers) ![](https://i.imgur.com/942zI96.png) * 運作機制是,假設現在有 4 block ,原本的 block A miss 了,那我將會把後來連續的 block 上移,變成 A+1 , A+2 , A+3 , A+4 我們稱之為 Sequential Prefetching ,常常用在指令的預取上( spatial locality ) * GHB(Global History Buffer) 參考 [Data Cache Prefetching Using a Globale History Buffer ](http://www.eecg.toronto.edu/~steffan/carg/readings/ghb.pdf) * an alternative structure for holding prefetcher history.() * address history is held in table (FIFO) * GHB have a table with all globale miss addresses * history information is maintained in linked list * 如果想要拿取資料,必須透過 hash table ![](https://i.imgur.com/aqcjZXw.png) * 因為之前有很多軟體和硬體的預取研究,但卻很少專注在兩者之間的互動上,以這方面為目標找尋以下答案: (1) What are the limitations and overheads of software prefetching? (2) What are the limitations and overheads of hardware prefetching? (3) When is it beneficial to use software and/or hardware prefetching? * positive impact : * software prefetching target: * short array streams * irregular memory address patterns * and L1 cache miss reduction * negative effects * we also observe that software prefetching can interfere with the training of the hardware prefetcher * 軟體的預取 : we consider use prefetch intrinsics on top of gcc or icc compiler-inserted prefetching. ### 2.BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING * 常見的資料結構 ![](https://i.imgur.com/R2dJN2I.png) * cache-line 就是 memory 和 cache 之間搬動的最小單位 * Hardware pref. column suggests possible hardware prefetching mechanisms that are suitable for the respective data structures. * The Software pref. column includes possible software prefetching mechanisms. * hash are much harder to prefetch effectively * we refer to unit-stride cache-line accesses as streams and access stride distances greater than two cache lines as strided 名詞解釋:參考 [wiki](https://en.wikipedia.org/wiki/Stride_of_an_array) unit stride : An array with stride of exactly the same size as the size of each of its elements is contiguous in memory. * Software Prefetch Intrinsics * part of the x86 SSE SIMD extension ![](https://i.imgur.com/H2wnSSb.png) 註:Temporal — Data will be used again soon * 指令格式 _mm_prefetch(char *ptr,int hint) * 一個 intrinsic 大致上可以分成 2 個指令(直接獲取的地址預取)或是 4 個指令(間接獲取的記憶體地址) ~~~ 解釋一下文中所謂大致分成 2 個指令及 4 個指令,參照論文給的例子 : LEA &a[i], R0 //calculate addr(a[i]) PREFETCH R0 //prefetch a[i] (a)Direct Indexing Pattern (a[i]) LEA &b[i], R0 //calculate addr(b[i]) MOV R1, R0 //LD MEM[R1]: access b[i] LEA &a[0]+R1, R2//calculate addr(a[b[i]]) PREFETCH R2 //prefetch a[b[i]] (b)Indirect Indexing Pattern (a[b[i]]) 直接 : 直接得到(eg.A[i])的東西。 間接 : 透兩層以上值的傳遞(eg.B[A[i]])才得到。 ~~~ * 預取分為六類: ![](https://i.imgur.com/UENcFhI.png) * 我們利用時間軸來說明 ![](https://i.imgur.com/dkRCWRc.png) * Software Prefetch Distance: ![](https://i.imgur.com/5PudKdd.png) * prefetch requests are sent early enough to fully hide memory latency. * This value D is called the prefetch distance * D 的定義 : the distance ahead of which a prefetch should be requested on a memory address. * l : prefetch latency * s : length of the shortest path through the loop body * 因為 l 和 s 在 runtime 的時候都是變動的,所以我們討論 D (it is large enough to hide the latency) * 關於 D 的討論: * D 太大 : prefetched data could evict useful cache blocks and the elements in the beginning of the array may not be prefetched * 後果: less coverage and more cache misses * Direct and Indirect Memory Indexing * direct memory indexing: * 在 hardware 的預取上是簡單的 * 優點: since the memory addresses show regular stream/stride behavior * 缺點: requires special hardware prefetching mechanisms. * indirect memory indexing * relatively simpler to compute in software. * 地址的計算對軟體而言是簡單的 * have a higher overhead than direct memory access * two loads are required * In that case, we need two levels of prefetch streams ~~~ LEA &a[i], R0 //calculate addr(a[i]) PREFETCH R0 //prefetch a[i] (a)Direct Indexing Pattern (a[i]) LEA &b[i], R0 //calculate addr(b[i]) MOV R1, R0 //LD MEM[R1]: access b[i] LEA &a[0]+R1, R2//calculate addr(a[b[i]]) PREFETCH R2 //prefetch a[b[i]] (b)Indirect Indexing Pattern (a[b[i]]) ~~~ ### 3.POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING * 軟體預取比硬體預取好的地方 * Large Number of Streams * 被硬體資源限制,例如: stream detectors and book-keeping mechanisms * 超過硬體的能力 :scientific computing workloads that have multiple streams * 例如: lbm , a stencil-based computation, a 3D grid data structure and loop references all 27 nearest neighbors for each grid point ( b[i][j][k]=(a[i][j][k]+a[i][j][k+1]+a[i][j][k−1] +a[i][j−1][k]+a[i][j+1][k]+...)/27 ) * Software prefetching can insert prefetch requests independently for each stream in lbm * Short Streams * Hardware prefetchers require training time to detect the direction and distance of a stream or stride * 即使是 aggressive hardware prefetchers 也至少要兩個 cache misses 去偵測 stream 的方向 * Stream 太短 : there will not be enough cache misses to train a hardware prefetcher and load useful cache blocks * 例如: milc exhibits short stream behavior. it operates on many 3x3 matrices using 16 bytes per element, so the total size of the matrix is 144 bytes, or just three cache blocks. * these streams are too short to train the hardware prefetchers * Irregular Memory Access * 軟體只要 inserting appropriate prefetch intrinsics , 而硬體是更複雜的 * RDS 是合適 mcf * Cache Locality Hint * 硬體的預取 : place the data in the lower level cache of the requesting core * 好處 : lower-level (L2 or L3) prefetching block insertion greatly reduces the higher level (L1) cache pollution * 壞處 : L2 to L1 latency can degrade performance significantly. * 軟體的預取 : software prefetched data is placed directly into the L1 cache * greater flexibility : place data in the right cache level * Loop Bounds * 我們能控制的因素 : loop bounds in array data structures in software * prevent generating prefetch requests: * loop unrolling * software pipelining * branch instruction --- 註 : aggressive : large prefetch distance and high prefetch degree overly aggressive prefetching results in early or incorrect prefetch requests, in addition to consuming memory bandwidth --- * 軟體預取的負面影響 * Increased Instruction Count * consume fetch and execution bandwidth and machine resources * 圖中的bwave就因為軟體預取而指令數增加100%。 ![](https://i.imgur.com/b3bXlYJ.png) * Static Insertion * 不能在 runtime 的時候動態調整 memory latency, effective cache size , and bandwidth * 因為變化中的 memory latency 或快取的大小也都無從掌握 * especially in heterogeneous architectures * In hardware, prefetch distance and prefetching degree are usually fixed as well * 現有的硬體: feedback driven adaptive hardware prefetching mechanisms * 軟體 : adaptive mechanisms are largely nonexistent * Code Structure Change * 當 loop 中的指令數很少,插入軟體的預取指令提供即時的預取要求是有挑戰性的 * 原因 : there are not enough computations between prefetching instructions and demand requests * 解決辦法 :code structure changes,such as loop splitting, are required. * 造成 : 指令數變多 , 軟體預取困難 * 使用軟體+硬體的預取帶來的好處 * Handling Multiple Streams * cover more data streams * 例子 : a hardware prefetcher might cover regular streams and software prefetching can cover irregular streams * Positive Training * Software prefetch requests can actually help train the hardware prefetcher * 例子 : If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness. * 使用軟體+硬體的預取帶來的壞處 * Negative Training * If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly * software prefetch instructions can trigger overly aggressive hardware prefetches,which results in early requests. * Harmful Software Prefetching * Hardware prefetchers generally have a lower accuracy than software prefetchers. * 但當軟體預取不正確時 ,會增加 stress on cache 和 memory bandwidth ,此時通常就降低硬體預取效應 ### 4.EXPERIMENTAL METHODOLOGY * 實驗目標是將這些互動量化 * 找出有潛力的軟體預取 * 插入額外的 software prefetch intrinsics * To make our choice of prefetch distance systematic, we compute it as ##### Prefetch Intrinsic Insertion Algorithm ![](https://i.imgur.com/BRI7wnf.png) * A prefetch candidate: L1 cache misses / 1000 個指令 (MPKI) > 0.05 * 代號 * k 是常數 * L = average memory latency * IPCbench = profiled average IPC/one benchmack * Wloop 是迴圈平均每一圈的指令數 在這使用 K=4 , L= 300 * IPC bench 和 Wloop 是根據數據資料,所以 distance 的選擇是 effectively profile-driven ### 5.EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING #### Overhead and Limitations of Software Prefetching * Instruction Overhead: 主要的軟體預取負面衝擊 * 我們可以發現每一種 LANG/TYPE 指令,都會佔一定的 cycle 數,這邊用 IPC 表示 ![](https://i.imgur.com/0ZZutWy.png) * 這裡有一段跟前面 3.2 section 做討論的部份,我將圖直接貼上來 ![](https://i.imgur.com/b3bXlYJ.png) * For GemsFDTD,bwaves,and leslie3d, the total number of instructions increases by more than 50% (more than 100% for bwaves) * 為了達到預取的效果上述都增加 50% 以上的指令數,照裡來說增加了很重的負擔 * 不可思議的是 GemsFDTD 增加指令數超過 50%,它仍然是 positive ,原因是他預取的優點超過了這個指令數帶來的負擔 ![](https://i.imgur.com/GX2WbxF.png) * To measure the overhead of software prefetching,we eliminate the overhead of cache pollution (SW+P), bandwidth consumption (SW+B,between core and memory), memory access latency (SW+L), redundant prefetch overhead (SW+R), and instruction overhead (SW+I) * 這邊是在討論減少什麼方面 cycle 數可以下降最多,一次只控制一個變因去討論,何者影響最大 * Cache pollution would be caused by early or incorrect prefetches,但在對他們設計的實驗影響不大 * 影響最小的 SW+B : current machines provide enough bandwidth for single-thread applications * SW+L : software prefetching is not completely hiding memory latency. * SW+R : 很特殊的是,即使存在大量冗餘預取,他的負面影響通常是可以忽略不計 #### The Effect of Prefetch Distance ![](https://i.imgur.com/mowwixn.png) * 圖6的 x-axis : prefetch distance relative to the base prefetch distance. * 單位是: a cache block. * The base prefetch distance is calculated using Equation (1): ![](https://i.imgur.com/5PudKdd.png) * This value D is called the prefetch distance * D 的定義 : the distance ahead of which a prefetch should be requested on a memory address. * l : prefetch latency * s : length of the shortest path through the loop body * D 不能太小要能夠 hide the latency , 但太大也不好,因為會在要預取時,資料已經被趕出有效的 cache-block * 但圖表 VII 顯示的 insensitive 還是佔多數 #### Cache-Level Insertion Policy * 一般的 out-of-order processor 可以容忍 L1 miss,所以將預取放到後面一點的 level (位的是避免汙染L1),但當 miss rate 高過處理器可以容忍的程度,效率會下降 * 軟體的好處:可以精準的且安全的直接的對L1 cache 做 insert prefetched blocks ,還不會有汙染的風險 * 實驗中: use the T0 cache-level placement hint ![](https://i.imgur.com/BcQJzP1.png) * 在實驗發現通常 T0 會比 T1 或 T2 來的好 * Note that T1 and T2 behave the same.( 因為他們用 two-level cach ) * T0 比 T1 (T2) 好的原因: hiding L1 cache misses by inserting prefetched blocks into the L1 cache ![](https://i.imgur.com/QQv3nwR.png) * 這張圖是在說明使用軟體預取後,L1 cache miss rate 從 10.4% 降到 0.01% #### Effect of Using Hardware and Software Prefetching Together * 為了要觀察軟硬體的互動,分別出兩種 hardware prefetcher 的 training 機制 * NT(SW+GHB,SW+STR) * 硬體預取的 training 演算法不包含軟體預取的 requests * the hardware prefetcher is only trained by demand cache misses. * Train(SW+GHB+T,SW+STR+T) * 硬體預取的 training 演算法包含軟體預取的 requests,除了 usual demand misses. ![](https://i.imgur.com/5rq7nR7.png) * sphinx3 benchmark shows a small benefit * milc,gcc,and soplex exhibit negative effects * 3–5% 的效益來自 training hardware prefetching early(減少 late 的預取) * 但卻有大的負面衝擊 55.9% for libquantum ,22.5% for milc * milc 不好的原因: short streams( software prefetching can trigger unnecessary hardware ) prefetches. * libquantum 不好的原因: (1) L1 miss penalties : 軟體預取通常插入 L1 L2,但硬體用軟體訓練通常會插入 L2 而不是 L1 (2) aggressive prefetches: 軟體和硬體有獨立的 prefetch distances, 當軟體訓練硬體 ,硬體預取的需求可能變得太 aggressive (意思就是指,我應該要用軟體的預取距離,插入的卻是硬體的預取距離) 造成 many early prefetches * 結論:儘量不要用軟體的預取要求去訓練硬體的預取 #### Prefetch Coverage ![](https://i.imgur.com/7GSHNjg.png) * Figure 10(a) the prefetch coverage or the ratio of useful prefetches to total L2 misses * igure 10(b) shows the prefetch coverage of SW+GHB and SW+STR * 我們發現軟體預取的 coverage 如果比硬體來的小的部份,通常 performance 是屬於 neutral and negative groups. * 結論 :less coverage 是 neutral and negative groups 失敗的原因 ## 開發環境 ~~~ Architecture: x86_64 CPU op-mode(s): 32-bit, 64-bit Byte Order: Little Endian CPU(s): 4 On-line CPU(s) list: 0-3 Thread(s) per core: 2 Core(s) per socket: 2 Socket(s): 1 NUMA node(s): 1 Vendor ID: GenuineIntel CPU family: 6 Model: 58 Model name: Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz Stepping: 9 CPU MHz: 1199.865 CPU max MHz: 3200.0000 CPU min MHz: 1200.0000 BogoMIPS: 5188.57 Virtualization: VT-x L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K NUMA node0 CPU(s): 0-3 ~~~ ## 分析程式碼 先參考 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx) * 說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異,以及 prefetcher 對 cache 的影響 * 原實驗實測結果 ~~~ tina@tina-X550VB:~/Hw/prefetcher$ ./main 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 60579 us sse: 115357 us naive: 245752 us ~~~ * 首先 naive_transpose , 這邊是一般的一個一個元素操作的轉置 ~~~c= void naive_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x++) for (int y = 0; y < h; y++) *(dst + x * h + y) = *(src + y * w + x); } ~~~ --- * sse_transpose 改善 naive_transpose 的部份 ~~~c= void sse_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ~~~ * 說明 _mm_unpacklo_epi32 參考 [Microsoft Specific](https://msdn.microsoft.com/zh-tw/library/x8atst9d(v=vs.100).aspx) ~~~ __m128i _mm_unpacklo_epi32 (__m128i a, __m128i b); Interleaves the lower 2 signed or unsigned 32-bit integers in a with the lower 2 signed or unsigned 32-bit integers in b. return value r0 := a0 ; r1 := b0 r2 := a1 ; r3 := b1 ~~~ * 他所做的轉置,看起來會是這樣,參考 [4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) ![](https://i.imgur.com/bO2IYYv.png) * 一個指令,操作多組數據的交換是 SIMD (Single Instruction, Multiple Data )的原意 * 這邊猜測變快的原因是: (1)SSE 一次操作多筆 DATA 所以時間變短 (2)for 迴圈的次數減少,使用論文中所謂的 unrolling * 測試:把 naive 版本的 unrolling 不使用 SSE (控制其一變因) ~~~c= void unrolling_naive_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x++) for (int y = 0; y < h; y += 4) { *(dst + x * h + y + 0) = *(src + y + 0 * w + x); *(dst + x * h + y + 1) = *(src + y + 1 * w + x); *(dst + x * h + y + 2) = *(src + y + 2 * w + x); *(dst + x * h + y + 3) = *(src + y + 3 * w + x); } } ~~~ * 結果發現: 原來減少迴圈次數可以快這麼多,但因為 unrolling 是一種用空間換取時間的方法,如果增加矩陣大小也是需要測試的,需先寫一份測試檔測跑多筆資料的平均時間,才能往下討論 ~~~ sse prefetch: 62782 us sse: 118891 us naive: 254991 us unrolling naive: 41136 us ~~~ >>操作 DATA 的 SSE 反而變慢,要討論其付出的時間代價。 * 從 cache-miss 的角度去分析這四類 * 將四份的結果利用 #if defined () 產出4個執行檔,以方便我實測 cache-miss * 這邊的觀察發現即使 unrolling_naive 是四者中最快的,可是他的 cache-miss 僅僅贏過 naive * 而預先將資料讀進來的 sse_prefetch ,只降低了 6% 的 cache-miss 但比起其他的版本預取是比較能降低 cache-miss Performance counter stats for './sse' (100 runs): 662,3840 cache-misses # 88.219 % of all cache refs ( +- 0.14% ) 750,8365 cache-references ( +- 0.03% ) 12,3653,6992 instructions # 1.40 insn per cycle ( +- 0.00% ) 8,8579,7861 cycles ( +- 0.23% ) 0.290275488 seconds time elapsed ( +- 0.27% ) perf stat --repeat 100 \ Performance counter stats for './sse_prefetch' (100 runs): 660,1353 cache-misses # 87.664 % of all cache refs ( +- 0.05% ) 753,0286 cache-references ( +- 0.02% ) 12,8261,1347 instructions # 1.77 insn per cycle ( +- 0.00% ) 7,2518,9318 cycles ( +- 0.07% ) 0.234799490 seconds time elapsed ( +- 0.17% ) perf stat --repeat 100 \ Performance counter stats for './naive' (100 runs): 1885,7380 cache-misses # 93.858 % of all cache refs ( +- 0.02% ) 2009,1354 cache-references ( +- 0.01% ) 14,4852,8038 instructions # 1.14 insn per cycle ( +- 0.00% ) 12,7384,4641 cycles ( +- 0.18% ) 0.417337455 seconds time elapsed ( +- 0.10% ) perf stat --repeat 100 \ Performance counter stats for './unrolling_naive' (100 runs): 217,2771 cache-misses # 94.526 % of all cache refs ( +- 0.03% ) 229,8595 cache-references ( +- 0.07% ) 14,4826,9218 instructions # 2.15 insn per cycle ( +- 0.00% ) 6,7249,7285 cycles ( +- 0.14% ) 0.217644893 seconds time elapsed ( +- 0.19% ) --- * sse_prefetch_transpose : 在操作資料前先預取,所花時間是原版三者中最少 ~~~c= void sse_prefetch_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { #define PFDIST 8 _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1); _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1); __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ~~~ * 先做出四者比較以方便觀察,以下 X 軸是次數,我做了 100 次, Y 軸是時間 ![](https://i.imgur.com/CQBdtXm.png) ### 學習 Cache-Level Insertion Policy 中測量 T0,T1,T2,NTA ![](https://i.imgur.com/H2wnSSb.png) * 先參考了 [Software Prefetching](https://lwn.net/Articles/255364/) * _MM_HINT_T0 fetches data to all levels * If the data item is in a higher level cache it is loaded into L1d. * The _MM_HINT_T1 hint pulls the data into L2 and not into L1d * If there is an L3 cache the _MM_HINT_T2 hints can do something similar for it * _MM_HINT_NTA * stands for non-temporal aligned * 儘可能的避免汙染 cache 因為資料是短時間內就會被用到了 * 當資料被逐出 L1d 他不會進入 L2,L3 而是直接寫回 memory * 參考文獻裏面給的建議和論文結果有呼應 * 一般而言,使用正確少量的數據 _MM_HINT_T0 是可以的,但這需要L1d大小足以容納所有預取的數據。如果立即使用的工作量太大,用 T1 和 T2 輔助是比較好的 ![](https://i.imgur.com/y3vWZrW.png) * 參考 [all-about-cpu-cache](http://cenalulu.github.io/linux/all-about-cpu-cache/) * 在資料量 4096 的情況下以時間來看是選 T0 最好,但目前對資料量的顫抖,以及如何搭配其他 T1,T2 還要做進一步的了解。 * 資料的顫抖表示其分佈不均,在這個情況下最不均勻的 hint 是 NAT 最均勻的是 T1  ![](https://i.imgur.com/oErDM0Q.png) * 2000 筆 88,4152 cache-misses Performance counter stats for './sse_prefetch' (100 runs): 88,4152 cache-misses 247,0661 L1-dcache-loads-misses # 2.23% of all L1-dcache hits ( +- 0.59% ) (79.42%) 1,1101,7071 L1-dcache-loads ( +- 0.24% ) (72.31%) 74,4840 L1-dcache-prefetch-misses ( +- 1.88% ) (40.05%) 0.058283823 seconds time elapsed ( +- 0.46% ) * 1000 筆 35,5783 cache-misses Performance counter stats for './sse_prefetch' (100 runs): 35,5783 cache-misses 42,4870 L1-dcache-loads-misses # 1.51% of all L1-dcache hits ( +- 4.53% ) (78.09%) 2818,4523 L1-dcache-loads ( +- 0.84% ) (55.08%) 1,9739 L1-dcache-prefetch-misses ( +- 5.46% ) (38.01%) 0.016823101 seconds time elapsed ( +- 1.40% ) * 察看實驗平台規格 Intel Core i5-3230M [資料來源](http://www.cpu-world.com/CPUs/Core_i5/Intel-Core%20i5-3230M%20(PGA)%20Mobile%20processor.html) * cache Line size 都是 64 bytes * 有支援 AVX | cache   |cache size | | -------- | -------- | | Level1 | 2 x 32 KB 8-way set associative instruction caches / 2 x 32 KB 8-way set associative data caches | |Level2|2 x 256 KB 8-way set associative caches| |Level3|3 MB 12-way set associative shared cache| ### 詳細閱讀 [software pipelining](https://hackmd.io/s/S1fZ1m6pe) 共筆以及裡頭對應的==錄影== #### 關於 cache 的架構 * 參考 [perf-stat](https://zhengheng.me/2015/11/12/perf-stat/) ,和上方共筆 ![](https://i.imgur.com/B9xRTPG.png) * 基本架構: 會發現 level-3 cache 是共用的 1. level-1 data cache 2. level-1 inst cache 3. MMU: memory management unit 4. TLB: translation lookaside buffer 5. level-2 cache 6. level-3 cache ~~~ 本身電腦的 cache 大小 L1d cache: 32K L1i cache: 32K L2 cache: 256K L3 cache: 3072K ~~~ * 資料的拿取 ![](https://i.imgur.com/hBkEQg6.png) * 上面那張圖 1. CPU 根據虛擬地址嘗試從 level-1 cache (存放 index )中讀取資料 2. 找不到,向 MMU 請求資料 * 下面那張圖(level1 和 level2 之間的細節) 3. MMU 從 TLB 中找虛擬地址的 cache (有-> step 4,無-> step 5) TLB :負責改進虛擬地址到實體地址轉換速度、存放虛擬地址的快取 4. MMU 轉換虛擬地址到實體地址 如果轉換失敗則發生 fault 如果 MMU 轉換成功,則從 level-2 cache(存放實際位置的 index)中讀取 如果 level-2 cache 中也没有,則需要向 level-3 cache 實際記憶體中請求 5. MMU 從 Main Memory 中的 translation tables (也稱 page tables) 獲取,同時存入 TLB * 這個操作由硬體實作,可由 MMU 通過硬體直接從實體記憶體中讀取 6. step 5 存入 TLB 後,回到 step 4 * 對應論文中的 Cache Locality Hint ,就不難理解為何在 level-1 cache 找到資料如此重要,因為 miss 的話,以下的 2,3步驟相對煩雜許多,所以我們才會在意以下幾點: * 硬體的預取 : place the data in the lower level 1 cache of the requesting core * 好處 : lower-level (L2 or L3) prefetching block insertion greatly reduces the higher level (L1) cache pollution * 壞處 : L2 to L1 latency can degrade performance significantly. * 軟體的預取 : software prefetched data is placed directly into the L1 cache * greater flexibility : place data in the right cache level ## 嘗試用 AVX 進一步提昇效能 ## 作業要求 * 閱讀論文 "[When Prefetching Works, When It Doesn’t, and Why](http://www.cc.gatech.edu/~hyesoon/lee_taco12.pdf)" (務必事先閱讀論文,否則貿然看程式碼,只是陷入一片茫然!),紀錄你的認知和提問 * [論文重點提示和解說](https://hackmd.io/s/HJtfT3icx)* ing and their interactions * 詳細閱讀 [software pipelining](https://hackmd.io/s/S1fZ1m6pe) 共筆以及裡頭對應的==錄影== * 在 Linux/x86_64 (注意,要用 64-bit 系統,不能透過虛擬機器執行) 上編譯並執行 [prefetcher](https://github.com/sysprog21/prefetcher) * 說明 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` 之間的效能差異,以及 prefetcher 對 cache 的影響 * 在 github 上 fork [prefetcher](https://github.com/sysprog21/prefetcher),參照下方「見賢思齊:共筆選讀」的實驗,嘗試用 AVX 進一步提昇效能。分析的方法可參見 [Matrix Multiplication using SIMD](https://hackmd.io/s/Hk-llHEyx) * 修改 `Makefile`,產生新的執行檔,分別對應於 `naive_transpose`, `sse_transpose`, `sse_prefetch_transpose` (學習 [phonebook](s/S1RVdgza) 的做法),學習 [你所不知道的 C 語言:物件導向程式設計篇](https://hackmd.io/s/HJLyQaQMl) 提到的封裝技巧,以物件導向的方式封裝轉置矩陣的不同實作,得以透過一致的介面 (interface) 存取個別方法並且評估效能 * 用 perf 分析 cache miss/hit * 學習 `perf stat` 的 raw counter 命令 * 參考 [Performance of SSE and AVX Instruction Sets](http://arxiv.org/pdf/1211.0820.pdf),用 SSE/AVX intrinsic 來改寫程式碼 * 將原本 8x8 矩陣推廣到更大的範圍,如 16x16, 32x32, 64x64 等等,並透過實際的實驗來驗證前述推論 * 在 [Homework2 作業區](https://hackmd.io/s/HkVvxOD2-) 詳細描述實驗設計、閱讀論文的心得和疑惑,以及你的觀察 ## 截止日期: * Oct 16, 2017 (含) 之前 * 越早在 GitHub 上有動態、越早接受 code review,評分越高