2017q1 Homework3 (software-pipelining)

contributed by <chenweiii>

硬體環境

Operating System: Ubuntu 16.02.2 LTS (64 bit)
CPU: Intel i7-6700 @ 3.4 GHz
Cache:
- L1d cache: 32 K 8-way set associative
- L1i cache: 32 K 8-way set associative
- L2 cache: 256 KB 4-way set associative
- L3 cache: 8 MB 16-way set associative
- cache alignment: 64 B
Memory: 32 GB

論文閱讀

部份內容節錄自論文 When Prefetching Works, When It Doesn’t, and Why 重點提示和解說

Summary

值得注意的是 Speedup 的計算方式，是以各 Benchmark 表現最好的 SW + HW prefetcher 與 HW prefetcher 做比較。所以由左至右的順序，是依軟硬 prefetcher 共存時哪個 benchmark 表現最好作排序。
可以觀察到除了 gcc, bzip2, soplex ，加上 intrinsics 表現都大幅提昇。

內文說明實驗採用了三種 binaries 及三種 simulated hardware prefetching schemes。但圖表上只看到 GHB & STR，並沒有看到第三種；而 binaries 也僅看到 baseline 及 SW 兩種，不知道第三種 binaries 為何。

在 Intel's optimization guidelines 便指出 SW prefetch 用在短陣列、連續且不規則的記憶體存取、減少 L1 cache miss 時會有正面的影響，與本篇論文分析之數據結果一致。

但我沒有印象這篇論文有特別琢磨這三項的分析，不知道是不是我漏看了。

此篇論文目的便是談討 SW & PW prefetcher 之間的交互作用，最後也有觀察到在某些 benchmark 裡 SW prefetcher 會對 HW prefetcher 造成過度的訓練，造成巨大的負面影響。

Background on SW & HW prefetching

不同的資料結構有各式各樣適合的 prefetching mechanism，例如 Recursive Data Structures (RDS)，便適合由 SW prefetcher 處理而不是交由 HW prefetcher。
定義 cache-line access 分類
- stream: unit-stride cache-line access
- strided: access stride distances greater than two cache lines
stream prefetcher, GHB prefetcher, content-based prefetcher 皆為 Hardware-based prefetching mechanisms，但本實驗僅採用前兩個已成功商業化的 HW prefetcher。

不知道這邊理解的對不對。 GHB 的相關文章正在閱讀中。

Prefetch Classification

依 Demand Access Time 落在此軸何處，來決定此 prefetch 的分類。

Redundant_dc: 欲 prefetch 的 cache block 已經在 cache 裡。

該如何觀察各個種類的 prefetch 次數？

Miss status holding registers (MSHR)
- 參考 cornell memory system 的講義
- When a miss occurs, the MSHRs are looked up to determine if the cache block is already being fetched.
- Each MSHR holds information for a primary miss.
- If an MSHR hit, then a secondary miss has occurred. (primary miss to the same line is outstanding)

仍然不太了解 MSHR 在 Miss-Under-Miss Cache Design 裡扮演什麼角色及過程。卡關…但在此篇論文後續沒有討論到 MSHR，所以之後想辦法再找文章閱讀。

Software Prefetch Distance

Prefetching is useful only if prefetch requests are sent early enough to fully hide memory latency.
If prefetch distance is too large, prefetched data could evict useful cache blocks, and the elements in the beginning of the array may not e prefetched, leading to less coverage and more cache misses.
Prefetch coverage
- the ratio of useful prefetches to total L2 misses.
- 藉由 prefetch 避掉的 cache miss 比例
- 100 x (Prefetch Hits / (Prefetch Hits + Cache Misses))

prefetch hits 的數據該如何觀察？

Direct and Indirect Memory Indexing

Direct memory indexing can be easily prefetched by hardware since the memory addresses show regular stream/stride behavior.
Indirect indexing is relatively simpler to compute in software, but it usually has a higher overhead than that of direct memory index.

SW prefetching 之優缺點及 HW + SW 的效果

我很好奇這邊的整理，是源自於其他文獻，還是歸納自本篇論文的實驗結果？

Benefits of SW prefetching over HW

Large Number of Streams (Limited Hardware Resources)
Short Streams
Irregular Memory Access
Cache Locality Hint
- 多數 hardware prefetcher 將資料放在 lower cache level (L1 or L3)，但是 software prefetcher 可以依提示直接放進 L1 cache
- Lower-level prefetching block insertion greatly reduces the higher level cache pollution, but L2 to L1 latency can degrade performance significantly.
- So, there is greater flexibility of placing data in the right cache level in the SW prefetching mechanism.
Loop Bounds
- Several methods, such as loop unrolling, software pipelining, and using branch instructions, can prevent generating prefetch requests out of array bounds in software.
- However, overly aggressive prefetching results in early or incorrect prefetch requests, in addition to consuming memory bandwidth.

Negative Impacts of Software Prefetching

Increased Instruction Count
Static Insertion
- The decision to prefetch and choice of parameters , such as data to be prefetched and the corresponding prefetch distance, are made statically, and therefore cannot adapt to runtime behavioral changes such as varying memory latency, effective cache size, and bandwidth, especially in heterogeneous architectures.
Code Structure Change

Synergistic Effects when using HW + SW prefetching

Synergistic: 協同作用的

Handling Multiple Streams
- HW prefetcher cover regular stream.
- SW prefetcher cover irregular stream.
Positive Training
- e.g. If a block prefetched by a software prefetcher is late, then a trained hardware prefetcher can improve prefetch timeliness.

Antagonistic Effects

Antagonistic: 對抗性的

Negative Training
- Software prefetch requests can slow down the hardware prefetcher training.
  - e.g. If prefetched blocks by software prefetching hide a part of one or more streams, the hardware prefetcher will not be trained properly.
  - [註] 不太了解這個例子想表達的意思。
- Software prefetch instructions can trigger overly aggressive hardware prefetches, which results in early requests.
Harmful Software Prefetching
- When software prefetch requests are incorrect or early, the increased stress on cache and memory bandwidth can further reduce the effectiveness of hardware prefetching.

學習 SSE

Reference: Intel Intrinsics Guide

先學習程式上用到之指令。
東番西找，找不到跟 SSE 有關的教學，沒想到 Intrinsics Guide 已提供最清楚的解釋。

先閱讀 Hotball 的小屋撰寫的 SSE 介紹

SSE 的 intrinsics 的命名形式
- _mm_<opcode>_<suffix>
- 其中 <opcode> 是指令的類別。而 <suffix> 則是資料的種類。
- 在 SSE 浮點運算指令中，只有兩個種類： ps 和 ss
  - ps 是指 Packed Single-precision，也就是這個指令對暫存器中的四個單精度浮點數進行運算。
  - ss 則是指 Scalar Single-precision，也就是這個指令只對暫存器中的 DATA0 進行運算。

仍不太清楚下列三個指令之 <suffix> 所代表之意思 e.g. si128, epi32

要使用 SSE 的 intrinsics 之前，要記得先包含 xmmintrin.h 這個 header 檔。
絕大部份需要存取記憶體的 SSE 指令，都要求位址是 16 的倍數（也就是對齊在 16 bytes 的邊上）
- 如果沒有對齊 16 bytes 的話，就不能用 _mm_load_ps 這個 intrinsic 來載入，而要改用 _mm_loadu_ps 這個 intrinsic。它是專門用來處理沒有對齊在 16 bytes 邊上的資料的。但是，它的速度會比較慢。

~~找到一個可能可以改善的方向，但要如何使矩陣的起始位址為 16 的倍數？~~

根據其計算結果，再設定適當的旗標（divide-by-zero、invalid 等等），或是產生 exception。這可以由一個 MXCSR 暫存器來設定。MXCSR 暫存器是一個 32 位元的旗標暫存器，可以設定是否要產生各種 exception，並會記錄上次的計算中，發生了哪些情況。
SSE 除了運算的指令之外，還支援了一些 cache 控制指令： prefetch 和 movntps。 prefetch 指令實際上有四個不同的指令，包括 prefetch0、 prefetch1、 prefetch2 和 prefetchnta。不過，它們都是用同一個 intrinsic 表示的，也就是 _mm_prefetch。
- prefetch 指令的主要目的，是提前讓 CPU 載入稍後運算所需要的資料。通常是在對目前的資料進行運算之前，告訴 CPU 載入下一筆資料。 這樣就可以讓目前的運算，和載入下一筆資料的動作，可以同時進行。 如果運算的複雜度夠高的話，這樣可以完全消除讀取主記憶體的 latency。不同的 prefetch 指令則是告訴 CPU 將資料載入不同層次的 cache。不過，最常用的還是 prefetchnta，這個指令會把資料載入到離 CPU 最近的 cache 中（通常是 L1 cache 或 L2 cache），適用於資料在短時間內就會用到的情形。

[impl.c] 所用到之 SSE intrinsic

void _mm_prefetch (char const* p, int i)

Description
- Fetch the line of data from memory that contains address p to a location in the cache heirarchy specified by the locality hint i.

__m128i _mm_loadu_si128 (__m128i const* mem_addr)

Description
- Load 128-bits of integer data from memory into dst. mem_addr does not need to be aligned on any particular boundary.
Operation
- dst[127:0] := MEM[mem_addr+127:mem_addr]

__m128i _mm_unpacklo_epi32 (m128i a, m128i b)

Description
- Unpack and interleave 32-bit integers from the low half of a and b, and store the results in dst.
Operation
- 將 128 bit 切一半，兩變數 a, b 分別取低位址的 64 bit，交叉存放於新的變數。
將 lo 改成 hi 則依其意。

void _mm_storeu_si128 (__m128i* mem_addr, __m128i a)

Description
- Store 128-bits of integer data from a into memory. mem_addr does not need to be aligned on any particular boundary.

實際紙筆操作老師的 transpose 程式後，實在太神奇了！

raw counter

查找 Intel® 64 and IA-32 Architectures Developer’s Manual: Vol. 3B，我是第六代處理器 skylake

在這邊 raw counter 找得很挫折，不知道哪一個可以幫助我分析，想嘗試找跟 <yenWu> 相同的兩個 event，但找不到。

初步觀察

三種版本之執行時間

每個版本各取樣 100 次

naive: 		 233908 us
sse: 		 154229 us
sse prefetch: 	 42235 us

使用 perf 分析 cache misses

$perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./...

$sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"

naive impl.

Performance counter stats for './naive' (100 runs):

        41,213,342      cache-misses              #   85.814 % of all cache refs      ( +-  0.29% )
        48,026,565      cache-references                                              ( +-  0.25% )
     1,448,367,681      instructions              #    1.55  insn per cycle           ( +-  0.00% )
       936,303,948      cycles                                                        ( +-  0.30% )

       0.320454039 seconds time elapsed                                          ( +-  0.45% )

sse impl.

Performance counter stats for './sse' (100 runs):

        14,535,369      cache-misses              #   75.659 % of all cache refs      ( +-  0.07% )
        19,211,776      cache-references                                              ( +-  0.05% )
     1,236,449,412      instructions              #    1.79  insn per cycle           ( +-  0.00% )
       690,682,519      cycles                                                        ( +-  0.15% )

       0.257315101 seconds time elapsed                                          ( +-  0.24% )

sse_prefetch impl.

 Performance counter stats for './sse_prefetch' (100 runs):

         8,599,805      cache-misses              #   65.805 % of all cache refs      ( +-  0.29% )
        13,068,657      cache-references                                              ( +-  0.18% )
     1,282,489,200      instructions              #    2.40  insn per cycle           ( +-  0.00% )
       534,001,192      cycles                                                        ( +-  0.14% )

       0.140952357 seconds time elapsed                                          ( +-  0.40% )

針對 SSE version 進行改善

改善 SSE 在 load 時對齊的情況，後來翻閱 <yenWu> 也有針對這個問題撰寫一節，但似乎目前沒有結果。

memalign - allocate aligned memory

int posix_memalign(void **memptr, size_t alignment, size_t size);

Description
- The function posix_memalign() allocates size bytes and places the address of the allocated memory in *memptr. The address of the allocated memory will be a multiple of alignment, which must be a power of two and a multiple of sizeof(void *). If size is 0, then posix_memalign() returns either NULL, or a unique pointer value that can later be successfully passed to free(3).

void *memalign(size_t alignment, size_t size);

Description
- The obsolete function memalign() allocates size bytes and returns a pointer to the allocated memory. The memory address will be a multiple of alignment, which must be a power of two.

修改的程式碼














/* main.c */
    int *src = (int *) memalign(16, sizeof(int) * TEST_W * TEST_H);
    int *out = (int *) memalign(16, sizeof(int) * TEST_W * TEST_H);
    
/* impl.c */
    __m128i I0 = _mm_load_si128((__m128i *)(src + (y + 0) * w + x));
    __m128i I1 = _mm_load_si128((__m128i *)(src + (y + 1) * w + x));
    __m128i I2 = _mm_load_si128((__m128i *)(src + (y + 2) * w + x));
    __m128i I3 = _mm_load_si128((__m128i *)(src + (y + 3) * w + x));
    
    _mm_store_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
    _mm_store_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
    _mm_store_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
    _mm_store_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);

結果

sse: 		 155314 us
sse prefetch: 	 44440 us

沒有顯著的改善。

 Performance counter stats for './sse' (100 runs):

         1579,7948      cache-misses              #   82.011 % of all cache refs      ( +-  0.33% )
         1926,3250      cache-references                                              ( +-  0.07% )
      12,3650,5999      instructions              #    1.52  insn per cycle           ( +-  0.00% )
       8,1370,9554      cycles                                                        ( +-  0.66% )

       0.273664634 seconds time elapsed                                          ( +-  0.40% )

 Performance counter stats for './sse_prefetch' (100 runs):

         1018,6595      cache-misses              #   72.837 % of all cache refs      ( +-  0.59% )
         1398,5518      cache-references                                              ( +-  0.24% )
      12,8251,1076      instructions              #    2.06  insn per cycle           ( +-  0.00% )
       6,2351,1430      cycles                                                        ( +-  0.49% )

       0.168184185 seconds time elapsed                                          ( +-  0.55% )

sse & sse_prefetch 的 cache-misses ratio，皆較原本的版本高出 7%

不知道為什麼 cache-misses 會增加，仿 phonebook 一樣觀察了哪些地方是 cache-misses 的熱點，但我看不太出什麼東西 orz

AVX version

先嘗試不要看其他人的 code，翻一下 Intel Intrinsics Guide 找出對應的指令。寬度變為 256 bits 的話，迴圈就變為一次處理 8*8 的方塊矩陣，那這樣原本的 verify 似乎便不能使用… Fri, Mar 31, 2017 5:27 PM
紙筆操作很順利，但似乎發現沒有單位為 128 bits 的 unpack…Fri, Mar 31, 2017 5:27 PM
沒有想像中容易，因為 AVX 的 unpack 運作與 SSE2 也不同，經過一番操作後終於湊出來（咦？） Fri, Mar 31, 2017 7:41 PM
撰寫 AVX 版本時， DEBUG 時用到大量的 gdb Fri, Mar 31, 2017 8:29 PM

breakpoint

新增的程式碼









/* verify whether this two matrix is identical */
static int equal(int *a, int *b, int w, int h)
{
    for (int x = 0; x < w; x++)
        for (int y = 0; y < h; y++)
            if (*(a + x * h + y) != *(b + x * h + y))
                return -1; 
    return 1;
}

用來與 naive_transpose 比較，驗證 avx_transpose 結果是否正確。

使用到的 AVX intrinsic

__m256i _mm256_unpacklo_epi32 (m256i a, m256i b)

Description
- Unpack and interleave 32-bit integers from the low half of each 128-bit lane in a and b, and store the results in dst.

被這個可搞慘了，不能直接把 SSE2 拿來類推，要看清楚手冊。

__m256i _mm256_permute2x128_si256 (m256i a, m256i b, const int imm8)

Description
- Shuffle 128-bits (composed of integer data) selected by imm8 from a and b, and store the results in dst.
- 依需要調整 imm8，手冊有說明 Operation 該如何設定。

還好有閱讀到 <yenWu> 的筆記有提到這個 intrinsic，不然根本不知道該選這個

結果

avx: 		 64688 us

相較於 sse 版本之 speedup: 2.4

 Performance counter stats for './avx' (100 runs):

         1098,0390      cache-misses              #   72.052 % of all cache refs      ( +-  0.44% )
         1523,9534      cache-references                                              ( +-  0.09% )
      11,4866,6522      instructions              #    2.07  insn per cycle           ( +-  0.00% )
       5,5608,8254      cycles                                                        ( +-  0.80% )

       0.174747969 seconds time elapsed                                          ( +-  0.74% )

分析 prefetch distance 對效能之影響

想試著重現 <kaizsv> 的實驗 Fri, Mar 31, 2017 8:23 PM
測試了一下不同的 prefech distance 影響很大，這個圖表必須作！ Fri, Mar 31, 2017 8:24 PM
遇到了一個困難是，若要測試 PFDIST，勢必要改為傳入的參數，但這樣就動到 transpose 一致的界面，該怎麼處理最好？ Fri, Mar 31, 2017 8:45 PM
最後很克難的只好改寫 prefetch 的界面，先把圖畫出來再說。 Fri, Mar 31, 2017 9:17 PM