Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

2017q3 Homework2 (software-pipelining)

contributed by <tina0405>

閱讀論文

題目： When Prefetching Works, When It Doesn’t, and Why

1.Introduce

介紹 prefetch 近況：
- Although a variety of software and hardware prefetching mechanisms for tolerating cache miss latency exist,relatively simple software prefetching algorithms have appeared in state-of-the-art compilers like icc and gcc.
insert prefetching intrinsics 存在的兩個問題 :
- few rigorous guidelines on how best to insert prefetch intrinsics.
- the complexity of the interaction between software and hardware prefetching is not well understood.

名詞解釋：

Cache prefetching 參考 wiki ： fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed
Intrinsics 先參考了 SSE 介紹

裏面提到它是類似一般的函數，但是實際上會被 compiler 直接編譯成組合語言的東西，例如
像這種類型宣告：
__m128 data1, data2;//128-bit

如果是 SSE 的 intrinsics 的基本的型式為：

    _mm_<opcode>_<suffix>

<opcode> 是指令的類別，像是 add、sub 等等。
<suffix> 則是資料的種類

SSE（Streaming SIMD Extensions） ： 是由 Intel 公司，在
1999 年推出 Pentium III 處理器時，同時推出的新指令集。
如同其名稱所表示的，SSE 是一種 SIMD 指令集。
SIMD(single instruction, multiple data) : 也就是
一個指令同時對多個資料進行相同的動作。

又參考了 SSE (Streaming SIMD Extentions) 先初步解釋論文中提到的 "insert prefetch intrinsics"

這邊寫 prefetch 指令可以將要用的資料提前 fetch 到 L1 and/or L2 cache
prefetch instructions ：
The prefetch instructions provide cache hints to
fetch data to the L1 and/or L2 cache before the
program actually needs the data.

包含其他從 memory 搬到 cache 的支援 ：
prefetcht0: move the data from m11emory to L1 and L2 caches using t0 hint.
prefetcht1: move the data from memory to L2 cache using t1 hint.
prefetchnta: move non-temporal aligned data from memory to L1 cache directly 
(bypass L2).

主要研究方向
- provide guidelines for inserting prefetch intrinsics in the presence of a hardware prefetcher.
- automated cooperative software/hardware prefetching schemes.

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

baseline binaries that have only compiler inserted software prefetchers (沒有硬體)
positive : software prefetching in excess of 5% .
negative : degrade performance more than 5% by using software prefetching.
Summary of the software and hardware prefetch
- GHB 和 STR 為硬體，也就是以上做出幾種劃分， SW/HW/SW+HW 來比較差異性:
  - STR（Stream Buffers）參考了 hardware prefetching
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
- 運作機制是，假設現在有 4 block ，原本的 block A miss 了，那我將會把後來連續的 block 上移，變成 A+1 ， A+2 ， A+3 ， A+4 我們稱之為 Sequential Prefetching ，常常用在指令的預取上（ spatial locality ）
- GHB(Global History Buffer) 參考 Data Cache Prefetching Using a Globale History Buffer
  - an alternative structure for holding prefetcher history.()
  - address history is held in table (FIFO)
  - GHB have a table with all globale miss addresses
  - history information is maintained in linked list
  - 如果想要拿取資料，必須透過 hash table
    Image Not Showing Possible Reasons
    The image file may be corrupted
    The server hosting the image is unavailable
    The image path is incorrect
    The image format is not supported
    Learn More →
因為之前有很多軟體和硬體的預取研究，但卻很少專注在兩者之間的互動上，以這方面為目標找尋以下答案:
(1) What are the limitations and overheads of software prefetching?
(2) What are the limitations and overheads of hardware prefetching?
(3) When is it beneficial to use software and/or hardware prefetching?
positive impact :
- software prefetching target:
  - short array streams
  - irregular memory address patterns
  - and L1 cache miss reduction
negative effects
- we also observe that software prefetching can interfere with the training of the hardware prefetcher
軟體的預取： we consider use prefetch intrinsics on top of gcc or icc compiler-inserted prefetching.

2.BACKGROUND ON SOFTWARE AND HARDWARE PREFETCHING

常見的資料結構
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- cache-line 就是 memory 和 cache 之間搬動的最小單位
- Hardware pref. column suggests possible hardware prefetching mechanisms that are suitable for the respective data structures.
- The Software pref. column includes possible software prefetching mechanisms.
- hash are much harder to prefetch effectively
we refer to unit-stride cache-line accesses as streams and access stride distances greater than two cache lines as strided

名詞解釋:參考 wiki
unit stride : An array with stride of exactly the same size as the size of each of its elements is contiguous in memory.

Software Prefetch Intrinsics
- part of the x86 SSE SIMD extension
  Image Not Showing Possible Reasons
  - The image file may be corrupted
  - The server hosting the image is unavailable
  - The image path is incorrect
  - The image format is not supported
  Learn More →
  註：Temporal — Data will be used again soon
- 指令格式 _mm_prefetch(char *ptr,int hint)
- 一個 intrinsic 大致上可以分成 2 個指令（直接獲取的地址預取）或是 4 個指令（間接獲取的記憶體地址）

解釋一下文中所謂大致分成 2 個指令及 4 個指令，參照論文給的例子 ：
LEA &a[i], R0 //calculate addr(a[i])
PREFETCH R0 //prefetch a[i]
(a)Direct Indexing Pattern (a[i])

LEA &b[i], R0 //calculate addr(b[i])
MOV R1, R0 //LD MEM[R1]: access b[i]
LEA &a[0]+R1, R2//calculate addr(a[b[i]])
PREFETCH R2 //prefetch a[b[i]]
(b)Indirect Indexing Pattern (a[b[i]])

直接 ： 直接得到(eg.A[i])的東西。
間接 ： 透兩層以上值的傳遞(eg.B[A[i]])才得到。

預取分為六類：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- 我們利用時間軸來說明
  Image Not Showing Possible Reasons
  The image file may be corrupted
  The server hosting the image is unavailable
  The image path is incorrect
  The image format is not supported
  Learn More →
Software Prefetch Distance：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- prefetch requests are sent early enough to fully hide memory latency.
- This value D is called the prefetch distance
  - D 的定義： the distance ahead of which a prefetch should be requested on a memory address.
- l : prefetch latency
- s : length of the shortest path through the loop body
  - 因為 l 和 s 在 runtime 的時候都是變動的，所以我們討論 D （it is large enough to hide the latency）
- 關於 D 的討論：
  - D 太大： prefetched data could evict useful cache blocks and the elements in the beginning of the array may not be prefetched
  - 後果： less coverage and more cache misses
Direct and Indirect Memory Indexing
- direct memory indexing：
  - 在 hardware 的預取上是簡單的
    - 優點： since the memory addresses show regular stream/stride behavior
    - 缺點： requires special hardware prefetching mechanisms.
- indirect memory indexing
  - relatively simpler to compute in software.
    - 地址的計算對軟體而言是簡單的
    - have a higher overhead than direct memory access
      - two loads are required
      - In that case, we need two levels of prefetch streams
      LEA &a[i], R0 //calculate addr(a[i]) PREFETCH R0 //prefetch a[i] (a)Direct Indexing Pattern (a[i]) LEA &b[i], R0 //calculate addr(b[i]) MOV R1, R0 //LD MEM[R1]: access b[i] LEA &a[0]+R1, R2//calculate addr(a[b[i]]) PREFETCH R2 //prefetch a[b[i]] (b)Indirect Indexing Pattern (a[b[i]])

3.POSITIVE AND NEGATIVE IMPACTS OF SOFTWARE PREFETCHING

軟體預取比硬體預取好的地方
- Large Number of Streams
  - 被硬體資源限制，例如： stream detectors and book-keeping mechanisms
  - 超過硬體的能力：scientific computing workloads that have multiple streams
    - 例如：
      lbm , a stencil-based computation, a 3D grid data structure and loop references all 27 nearest neighbors for each grid point ( b[i][j][k]=(a[i][j][k]+a[i][j][k+1]+a[i][j][k−1] +a[i][j−1][k]+a[i][j+1][k]+…)/27 )
  - Software prefetching can insert prefetch requests independently for each stream in lbm
- Short Streams
  - Hardware prefetchers require training time to detect the direction and distance of a stream or stride
    - 即使是 aggressive hardware prefetchers 也至少要兩個 cache misses 去偵測 stream 的方向
  - Stream 太短： there will not be enough cache misses to train a hardware prefetcher and load useful cache blocks
  - 例如： milc exhibits short stream behavior. it operates on many 3x3 matrices using 16 bytes per element, so the total size of the matrix is 144 bytes, or just three cache blocks.
    - these streams are too short to train the hardware prefetchers
- Irregular Memory Access
  - 軟體只要 inserting appropriate prefetch intrinsics ，而硬體是更複雜的
  - RDS 是合適 mcf
- Cache Locality Hint
  - 硬體的預取： place the data in the lower level cache of the requesting core
    - 好處： lower-level (L2 or L3) prefetching block insertion greatly reduces the higher level (L1) cache pollution
    - 壞處： L2 to L1 latency can degrade performance significantly.
  - 軟體的預取： software prefetched data is placed directly into the L1 cache
    - greater flexibility : place data in the right cache level
- Loop Bounds
  - 我們能控制的因素： loop bounds in array data structures in software
  - prevent generating prefetch requests:
    - loop unrolling
    - software pipelining
    - branch instruction

註： aggressive ： large prefetch distance and high prefetch degree
overly aggressive prefetching results in early or incorrect prefetch requests, in addition to consuming memory bandwidth

軟體預取的負面影響
- Increased Instruction Count
  - consume fetch and execution bandwidth and machine resources
  - 圖中的bwave就因為軟體預取而指令數增加100%。
- Static Insertion
  - 不能在 runtime 的時候動態調整 memory latency, effective cache size , and bandwidth
    - 因為變化中的 memory latency 或快取的大小也都無從掌握
  - especially in heterogeneous architectures
  - In hardware, prefetch distance and prefetching degree are usually fixed as well
  - 現有的硬體： feedback driven adaptive hardware prefetching mechanisms
  - 軟體： adaptive mechanisms are largely nonexistent
- Code Structure Change
  - 當 loop 中的指令數很少，插入軟體的預取指令提供即時的預取要求是有挑戰性的
  - 原因： there are not enough computations between prefetching instructions and demand requests
  - 解決辦法：code structure changes,such as loop splitting, are required.
  - 造成：指令數變多，軟體預取困難
使用軟體+硬體的預取帶來的好處
- Handling Multiple Streams
  - cover more data streams
    - 例子： a hardware prefetcher might cover regular streams and software prefetching can cover irregular streams
- Positive Training
  - Software prefetch requests can actually help train the hardware prefetcher
    - 例子： If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness.
使用軟體+硬體的預取帶來的壞處
- Negative Training
  - If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly
  - software prefetch instructions can trigger overly aggressive hardware prefetches,which results in early requests.
- Harmful Software Prefetching
  - Hardware prefetchers generally have a lower accuracy than software prefetchers.
  - 但當軟體預取不正確時，會增加 stress on cache 和 memory bandwidth ，此時通常就降低硬體預取效應

4.EXPERIMENTAL METHODOLOGY

實驗目標是將這些互動量化
- 找出有潛力的軟體預取
- 插入額外的 software prefetch intrinsics
- To make our choice of prefetch distance systematic, we compute it as

Prefetch Intrinsic Insertion Algorithm

A prefetch candidate: L1 cache misses / 1000 個指令 (MPKI) > 0.05
代號
- k 是常數
- L = average memory latency
- IPCbench = profiled average IPC/one benchmack
- Wloop 是迴圈平均每一圈的指令數

在這使用 K=4 ， L= 300

IPC bench 和 Wloop 是根據數據資料，所以 distance 的選擇是 effectively profile-driven

5.EVALUATIONS: BASIC OBSERVATIONS ABOUT PREFETCHING

Overhead and Limitations of Software Prefetching

Instruction Overhead：主要的軟體預取負面衝擊
- 我們可以發現每一種 LANG/TYPE 指令，都會佔一定的 cycle 數，這邊用 IPC 表示
- 這裡有一段跟前面 3.2 section 做討論的部份，我將圖直接貼上來
- For GemsFDTD,bwaves,and leslie3d, the total number of instructions increases by more than 50% (more than 100% for bwaves)
- 為了達到預取的效果上述都增加 50% 以上的指令數，照裡來說增加了很重的負擔
- 不可思議的是 GemsFDTD 增加指令數超過 50％，它仍然是 positive ，原因是他預取的優點超過了這個指令數帶來的負擔

To measure the overhead of software prefetching,we eliminate the overhead of cache pollution (SW+P), bandwidth consumption (SW+B,between core and memory), memory access latency (SW+L), redundant prefetch overhead (SW+R), and instruction overhead (SW+I)
- 這邊是在討論減少什麼方面 cycle 數可以下降最多，一次只控制一個變因去討論，何者影響最大
  - Cache pollution would be caused by early or incorrect prefetches，但在對他們設計的實驗影響不大
  - 影響最小的 SW+B : current machines provide enough bandwidth for single-thread applications
  - SW+L : software prefetching is not completely hiding memory latency.
  - SW+R ：很特殊的是，即使存在大量冗餘預取，他的負面影響通常是可以忽略不計

The Effect of Prefetch Distance

圖6的 x-axis ： prefetch distance relative to the base prefetch distance.
單位是： a cache block.
The base prefetch distance is calculated using Equation (1)：
Image Not Showing Possible Reasons
- The image file may be corrupted
- The server hosting the image is unavailable
- The image path is incorrect
- The image format is not supported
Learn More →
- This value D is called the prefetch distance
  - D 的定義： the distance ahead of which a prefetch should be requested on a memory address.
- l : prefetch latency
- s : length of the shortest path through the loop body
D 不能太小要能夠 hide the latency ，但太大也不好，因為會在要預取時，資料已經被趕出有效的 cache-block
- 但圖表 VII 顯示的 insensitive 還是佔多數

Cache-Level Insertion Policy

一般的 out-of-order processor 可以容忍 L1 miss，所以將預取放到後面一點的 level （位的是避免汙染L1），但當 miss rate 高過處理器可以容忍的程度，效率會下降
- 軟體的好處：可以精準的且安全的直接的對L1 cache 做 insert prefetched blocks ，還不會有汙染的風險
- 實驗中： use the T0 cache-level placement hint

在實驗發現通常 T0 會比 T1 或 T2 來的好
Note that T1 and T2 behave the same.（因為他們用 two-level cach ）
- T0 比 T1 (T2) 好的原因： hiding L1 cache misses by inserting prefetched blocks into the L1 cache
這張圖是在說明使用軟體預取後，L1 cache miss rate 從 10.4% 降到 0.01%

Effect of Using Hardware and Software Prefetching Together

為了要觀察軟硬體的互動，分別出兩種 hardware prefetcher 的 training 機制
- NT(SW+GHB,SW+STR)
  - 硬體預取的 training 演算法不包含軟體預取的 requests
  - the hardware prefetcher is only trained by demand cache misses.
- Train(SW+GHB+T,SW+STR+T)
  - 硬體預取的 training 演算法包含軟體預取的 requests，除了 usual demand misses.
  - sphinx3 benchmark shows a small benefit
  - milc,gcc,and soplex exhibit negative effects
  - 3–5% 的效益來自 training hardware prefetching early（減少 late 的預取）
  - 但卻有大的負面衝擊 55.9% for libquantum ，22.5% for milc
  - milc 不好的原因： short streams（ software prefetching can trigger unnecessary hardware ） prefetches.
  - libquantum 不好的原因：
    （1） L1 miss penalties ：軟體預取通常插入 L1 L2，但硬體用軟體訓練通常會插入 L2 而不是 L1
    （2） aggressive prefetches：軟體和硬體有獨立的 prefetch distances，當軟體訓練硬體，硬體預取的需求可能變得太 aggressive （意思就是指，我應該要用軟體的預取距離，插入的卻是硬體的預取距離）造成 many early prefetches
- 結論：儘量不要用軟體的預取要求去訓練硬體的預取

Prefetch Coverage

Figure 10(a) the prefetch coverage or the ratio of useful prefetches to total L2 misses
igure 10(b) shows the prefetch coverage of SW+GHB
and SW+STR
- 我們發現軟體預取的 coverage 如果比硬體來的小的部份，通常 performance 是屬於 neutral and negative groups.
結論：less coverage 是 neutral and negative groups 失敗的原因

開發環境

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 58
Model name:            Intel(R) Core(TM) i5-3230M CPU @ 2.60GHz
Stepping:              9
CPU MHz:               1199.865
CPU max MHz:           3200.0000
CPU min MHz:           1200.0000
BogoMIPS:              5188.57
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3

分析程式碼

先參考 Matrix Multiplication using SIMD

說明 naive_transpose, sse_transpose, sse_prefetch_transpose 之間的效能差異，以及 prefetcher 對 cache 的影響
原實驗實測結果

tina@tina-X550VB:~/Hw/prefetcher$ ./main
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 60579 us
sse: 		 115357 us
naive: 		 245752 us

首先 naive_transpose ，這邊是一般的一個一個元素操作的轉置






void naive_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x++)
        for (int y = 0; y < h; y++)
            *(dst + x * h + y) = *(src + y * w + x);
}

sse_transpose 改善 naive_transpose 的部份























void sse_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {
            __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
            __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
            __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
            __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
            __m128i T0 = _mm_unpacklo_epi32(I0, I1);
            __m128i T1 = _mm_unpacklo_epi32(I2, I3);
            __m128i T2 = _mm_unpackhi_epi32(I0, I1);
            __m128i T3 = _mm_unpackhi_epi32(I2, I3);
            I0 = _mm_unpacklo_epi64(T0, T1);
            I1 = _mm_unpackhi_epi64(T0, T1);
            I2 = _mm_unpacklo_epi64(T2, T3);
            I3 = _mm_unpackhi_epi64(T2, T3);
            _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
            _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
            _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
            _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
        }
    }
}

說明 _mm_unpacklo_epi32 參考 Microsoft Specific

__m128i _mm_unpacklo_epi32 (__m128i a, __m128i b);
Interleaves the lower 2 signed or unsigned 32-bit integers in a with the lower 
2 signed or unsigned 32-bit integers in b.

return value
r0 := a0 ; r1 := b0
r2 := a1 ; r3 := b1

他所做的轉置，看起來會是這樣，參考 4x4 integer matrix transpose in SSE2
一個指令，操作多組數據的交換是 SIMD (Single Instruction, Multiple Data ）的原意
這邊猜測變快的原因是：
（1）SSE 一次操作多筆 DATA 所以時間變短
（2）for 迴圈的次數減少，使用論文中所謂的 unrolling
測試：把 naive 版本的 unrolling 不使用 SSE （控制其一變因）










void unrolling_naive_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x++)
        for (int y = 0; y < h; y += 4) {
            *(dst + x * h + y + 0) = *(src + y + 0 * w + x);
            *(dst + x * h + y + 1) = *(src + y + 1 * w + x);
            *(dst + x * h + y + 2) = *(src + y + 2 * w + x);
            *(dst + x * h + y + 3) = *(src + y + 3 * w + x);
        }
}

結果發現：原來減少迴圈次數可以快這麼多，但因為 unrolling 是一種用空間換取時間的方法，如果增加矩陣大小也是需要測試的，需先寫一份測試檔測跑多筆資料的平均時間，才能往下討論

sse prefetch: 	 62782 us
sse: 		 118891 us
naive: 		 254991 us
unrolling naive: 41136 us

操作 DATA 的 SSE 反而變慢，要討論其付出的時間代價。

從 cache-miss 的角度去分析這四類
- 將四份的結果利用 #if defined () 產出4個執行檔，以方便我實測 cache-miss
- 這邊的觀察發現即使 unrolling_naive 是四者中最快的，可是他的 cache-miss 僅僅贏過 naive
- 而預先將資料讀進來的 sse_prefetch ，只降低了 6% 的 cache-miss 但比起其他的版本預取是比較能降低 cache-miss

Performance counter stats for './sse' (100 runs):

      662,3840      cache-misses              #   88.219 % of all cache refs      ( +-  0.14% )
      750,8365      cache-references                                              ( +-  0.03% )
  12,3653,6992      instructions              #    1.40  insn per cycle                                              ( +-  0.00% )
   8,8579,7861      cycles                                                        ( +-  0.23% )

   0.290275488 seconds time elapsed                                          ( +-  0.27% )

perf stat –repeat 100 \

Performance counter stats for './sse_prefetch' (100 runs):

      660,1353      cache-misses              #   87.664 % of all cache refs      ( +-  0.05% )
      753,0286      cache-references                                              ( +-  0.02% )
  12,8261,1347      instructions              #    1.77  insn per cycle                                              ( +-  0.00% )
   7,2518,9318      cycles                                                        ( +-  0.07% )

   0.234799490 seconds time elapsed                                          ( +-  0.17% )

perf stat –repeat 100 \

Performance counter stats for './naive' (100 runs):

     1885,7380      cache-misses              #   93.858 % of all cache refs      ( +-  0.02% )
     2009,1354      cache-references                                              ( +-  0.01% )
  14,4852,8038      instructions              #    1.14  insn per cycle                                              ( +-  0.00% )
  12,7384,4641      cycles                                                        ( +-  0.18% )

   0.417337455 seconds time elapsed                                          ( +-  0.10% )

perf stat –repeat 100 \

Performance counter stats for './unrolling_naive' (100 runs):

      217,2771      cache-misses              #   94.526 % of all cache refs      ( +-  0.03% )
      229,8595      cache-references                                              ( +-  0.07% )
  14,4826,9218      instructions              #    2.15  insn per cycle                                              ( +-  0.00% )
   6,7249,7285      cycles                                                        ( +-  0.14% )

   0.217644893 seconds time elapsed                                          ( +-  0.19% )

sse_prefetch_transpose ：在操作資料前先預取，所花時間是原版三者中最少





























void sse_prefetch_transpose(int *src, int *dst, int w, int h)
{
    for (int x = 0; x < w; x += 4) {
        for (int y = 0; y < h; y += 4) {
#define PFDIST  8
            _mm_prefetch(src+(y + PFDIST + 0) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 1) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 2) *w + x, _MM_HINT_T1);
            _mm_prefetch(src+(y + PFDIST + 3) *w + x, _MM_HINT_T1);

            __m128i I0 = _mm_loadu_si128 ((__m128i *)(src + (y + 0) * w + x));
            __m128i I1 = _mm_loadu_si128 ((__m128i *)(src + (y + 1) * w + x));
            __m128i I2 = _mm_loadu_si128 ((__m128i *)(src + (y + 2) * w + x));
            __m128i I3 = _mm_loadu_si128 ((__m128i *)(src + (y + 3) * w + x));
            __m128i T0 = _mm_unpacklo_epi32(I0, I1);
            __m128i T1 = _mm_unpacklo_epi32(I2, I3);
            __m128i T2 = _mm_unpackhi_epi32(I0, I1);
            __m128i T3 = _mm_unpackhi_epi32(I2, I3);
            I0 = _mm_unpacklo_epi64(T0, T1);
            I1 = _mm_unpackhi_epi64(T0, T1);
            I2 = _mm_unpacklo_epi64(T2, T3);
            I3 = _mm_unpackhi_epi64(T2, T3);
            _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
            _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
            _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
            _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
        }
    }
}

先做出四者比較以方便觀察，以下 X 軸是次數，我做了 100 次， Y 軸是時間

學習 Cache-Level Insertion Policy 中測量 T0,T1,T2，NTA

Image Not Showing Possible Reasons

The image file may be corrupted
The server hosting the image is unavailable
The image path is incorrect
The image format is not supported

Learn More →

先參考了 Software Prefetching
_MM_HINT_T0 fetches data to all levels
- If the data item is in a higher level cache it is loaded into L1d.
The _MM_HINT_T1 hint pulls the data into L2 and not into L1d
If there is an L3 cache the _MM_HINT_T2 hints can do something similar for it
*　_MM_HINT_NTA
- stands for non-temporal aligned
- 儘可能的避免汙染 cache 因為資料是短時間內就會被用到了
- 當資料被逐出 L1d 他不會進入 L2，L3 而是直接寫回 memory
參考文獻裏面給的建議和論文結果有呼應
- 一般而言，使用正確少量的數據 _MM_HINT_T0 是可以的，但這需要L1d大小足以容納所有預取的數據。如果立即使用的工作量太大，用 T1 和 T2 輔助是比較好的

參考 all-about-cpu-cache
在資料量　4096　的情況下以時間來看是選　Ｔ0　最好，但目前對資料量的顫抖，以及如何搭配其他　Ｔ１，Ｔ２　還要做進一步的了解。
資料的顫抖表示其分佈不均，在這個情況下最不均勻的　ｈｉｎｔ　是　ＮＡＴ　最均勻的是　Ｔ1

2000 筆 88,4152 cache-misses
Performance counter stats for './sse_prefetch' (100 runs):

     88,4152      cache-misses              
    247,0661      L1-dcache-loads-misses    #    2.23% of all L1-dcache hits    ( +-  0.59% )  (79.42%)
 1,1101,7071      L1-dcache-loads                                               ( +-  0.24% )  (72.31%)
     74,4840      L1-dcache-prefetch-misses                                     ( +-  1.88% )  (40.05%)

 0.058283823 seconds time elapsed                                          ( +-  0.46% )

1000 筆 35,5783 cache-misses
Performance counter stats for './sse_prefetch' (100 runs):

     35,5783      cache-misses              
     42,4870      L1-dcache-loads-misses    #    1.51% of all L1-dcache hits    ( +-  4.53% )  (78.09%)
   2818,4523      L1-dcache-loads                                               ( +-  0.84% )  (55.08%)
      1,9739      L1-dcache-prefetch-misses                                     ( +-  5.46% )  (38.01%)

 0.016823101 seconds time elapsed                                          ( +-  1.40% )

察看實驗平台規格　Intel Core i5-3230M　資料來源
- ｃａｃｈｅ　Line size　都是　64 bytes
- 有支援　AVX

cache	cache size
Level1	2 x 32 KB 8-way set associative instruction　caches　/　2 x 32 KB 8-way set associative data caches
Level2	2 x 256 KB 8-way set associative caches
Level3	3 MB 12-way set associative shared cache

詳細閱讀 software pipelining 共筆以及裡頭對應的錄影

關於 cache 的架構

參考 perf-stat ，和上方共筆
基本架構：會發現 level-3 cache 是共用的

level-1 data cache
level-1 inst cache
MMU： memory management unit
TLB： translation lookaside buffer
level-2 cache
level-3 cache

本身電腦的 cache 大小
L1d cache:             32K 
L1i cache:             32K 
L2 cache:              256K
L3 cache:              3072K

資料的拿取
- 上面那張圖
1. CPU 根據虛擬地址嘗試從 level-1 cache （存放 index ）中讀取資料
2. 找不到，向 MMU 請求資料
- 下面那張圖（level1 和 level2 之間的細節）
1. MMU 從 TLB 中找虛擬地址的 cache (有-> step 4，無-> step 5)
  TLB ：負責改進虛擬地址到實體地址轉換速度、存放虛擬地址的快取
  1. MMU 轉換虛擬地址到實體地址
    如果轉換失敗則發生 fault
    如果 MMU 轉換成功，則從 level-2 cache(存放實際位置的 index)中讀取
    如果 level-2 cache 中也没有，則需要向 level-3 cache 實際記憶體中請求
  2. MMU 從 Main Memory 中的 translation tables (也稱 page tables) 獲取，同時存入 TLB
    - 這個操作由硬體實作，可由 MMU 通過硬體直接從實體記憶體中讀取
  3. step 5 存入 TLB 後，回到 step 4
對應論文中的 Cache Locality Hint ，就不難理解為何在 level-1 cache 找到資料如此重要，因為 miss 的話，以下的 2，3步驟相對煩雜許多，所以我們才會在意以下幾點：
- 硬體的預取： place the data in the lower level 1 cache of the requesting core
  - 好處： lower-level (L2 or L3) prefetching block insertion greatly reduces the higher level (L1) cache pollution
  - 壞處： L2 to L1 latency can degrade performance significantly.
- 軟體的預取： software prefetched data is placed directly into the L1 cache
  - greater flexibility : place data in the right cache level

嘗試用 AVX 進一步提昇效能

作業要求

閱讀論文 "When Prefetching Works, When It Doesn’t, and Why" (務必事先閱讀論文，否則貿然看程式碼，只是陷入一片茫然!)，紀錄你的認知和提問
- 論文重點提示和解說* ing and their interactions
詳細閱讀 software pipelining 共筆以及裡頭對應的錄影
在 Linux/x86_64 (注意，要用 64-bit 系統，不能透過虛擬機器執行) 上編譯並執行 prefetcher
- 說明 naive_transpose, sse_transpose, sse_prefetch_transpose 之間的效能差異，以及 prefetcher 對 cache 的影響
在 github 上 fork prefetcher，參照下方「見賢思齊：共筆選讀」的實驗，嘗試用 AVX 進一步提昇效能。分析的方法可參見 Matrix Multiplication using SIMD
- 修改 Makefile，產生新的執行檔，分別對應於 naive_transpose, sse_transpose, sse_prefetch_transpose (學習 phonebook 的做法)，學習你所不知道的 C 語言：物件導向程式設計篇提到的封裝技巧，以物件導向的方式封裝轉置矩陣的不同實作，得以透過一致的介面 (interface) 存取個別方法並且評估效能
- 用 perf 分析 cache miss/hit
- 學習 perf stat 的 raw counter 命令
- 參考 Performance of SSE and AVX Instruction Sets，用 SSE/AVX intrinsic 來改寫程式碼
將原本 8x8 矩陣推廣到更大的範圍，如 16x16, 32x32, 64x64 等等，並透過實際的實驗來驗證前述推論
在 Homework2 作業區詳細描述實驗設計、閱讀論文的心得和疑惑，以及你的觀察

截止日期：

Oct 16, 2017 (含) 之前
越早在 GitHub 上有動態、越早接受 code review，評分越高