2017q1 Homework3 (software-pipelining)

contributed by < heathcliffYang >

論文閱讀

背景知識

因為背景知識的不足，所以我決定先試著硬讀一遍，把不懂的都標注起來（有些不懂的在論文後面會接著說明），去查完所有不懂的東西之後，再讀一遍。

Cache

在整個 memory hierarchy 中，是相對小、快的 memory，只由硬體機制來掌控。

可以儲存最近會被使用到的資料，使對 main memory 的存取加速。
在多處理器的系統中，可以作為 shared memory，減少 system bus 與 MM traffic 的需求。
Principles of locality :
- Temporal locality : Data which have been used recently have high likelihood of being used again.
- Spatial locality : If a data is referenced, it is very likely that nearby data will be accessed soon.
Three Structures :
- Fully associative cache
- Direct mapped cache
- Set associative cache
Two types :

Block / line : 資料在 cache 裡或被搬移的最小單位

Prefetching

CPU 在真正需要用到該資料/指令之前，就先提出要求，藉由減少等待指令或資料的時間，讓 CPU 可以加速執行程式。

當資料/指令被要求時，會先被放置在 cache 裡，這會比從 memory 拿還快；此外， prefetching 可以掩蓋 memory access latency (從 request 直到 CPU 拿到資料的時間)，也就可以拉近記憶體存取速度與處理器運算速度之間的落差。
這項技術也會是分支預測的一部分。
Prefetch 的種類
- Data / instruction prefetching
  用在 data block / instruction block 。其中，相較於 instruction access，data access 比較沒有一定的規律，所以 data prefetching 的準確性較低。
- Hardware prefetching
  用額外的硬體裝置去偵測 access patterns ，之後根據偵測結果做出對應的 prefetch 要求。
- Software prefetching
  Programmer 或 compiler 基於對程式的行為的了解，在程式碼裡面插入 prefetch instrinsics。
Prefetch metrics
- Prefetch degree : 在每次做 prefetching 時用到的 cache line 的數量。
- Prefetch distance : 可以針對多遠的資料/指令 stream 做 prefetching。
Cache prefetching
利用先將預測可能會用到的 instruction / data 先從存取速度較慢的記憶體 fetch 到存取速度較快的 local memory。
- Coverage : (Cache misses eliminated due to prefetching) / (Total cache misses)
- Accuracy : (Cache misses eliminated due to prefetching) / (Useless cache prefetches + Cache misses eliminated by prefetching)
- Timeliness :
Consider a for loop where each iteration takes 12 cycles to execute and the 'prefetch' operation takes 3 cycles. This implies that for the prefetched data to be useful, we must start the prefetch 12/3=4 iterations prior to its usage to maintain timeliness.

不懂維基百科上為何會這樣算?

GHB
STR
Training effect

疑惑

Execution time -> normalization 的方法，以及 Fig. 1 的 Norm. Execution cycle 的意思
Baseline 的實作到底是只有 SW prefetch 還是只有 HW prefetch?
Section 3.2 的 Figure 4 中，others 跟 prefetch 到底是什麼關係？

心得

因為是論文的關係，所以有很多藏在文章裡的描述與細節，如果是一篇好的分析，應該條列式列出每個實驗的主要探討因素。

Prefetch 實驗

Cache_test

 Performance counter stats for './main_naive' (100 runs):

        19,098,020      cache-misses              #   96.506 % of all cache refs      ( +-  0.04% )
        19,789,367      cache-references                                              ( +-  0.13% )
     2,187,809,082      instructions              #    1.29  insns per cycle          ( +-  0.49% )
     1,690,332,635      cycles                                                        ( +-  0.41% )

       0.593995654 seconds time elapsed                                          ( +-  0.66% )

 Performance counter stats for './main_sse' (100 runs):

         6,963,624      cache-misses              #   81.878 % of all cache refs      ( +-  0.08% )
         8,504,869      cache-references                                              ( +-  0.15% )
     2,602,566,869      instructions              #    1.49  insns per cycle          ( +-  0.28% )
     1,750,195,508      cycles                                                        ( +-  0.38% )

       0.617119812 seconds time elapsed                                          ( +-  0.65% )

 Performance counter stats for './main_sse_prefetch' (100 runs):

         7,058,261      cache-misses              #   91.616 % of all cache refs      ( +-  0.08% )
         7,704,204      cache-references                                              ( +-  0.17% )
     2,013,680,676      instructions              #    1.72  insns per cycle          ( +-  0.06% )
     1,172,678,900      cycles                                                        ( +-  0.31% )

       0.420111002 seconds time elapsed                                          ( +-  0.77% )