# 2017q1 Homework3 (software-pipelining)
contributed by<`zhanyangch`>
### Reviewed by `ZixinYang`
- 本篇發現 sse 及 sse-prefetch 的 cache-misses 結果差不多, 便應用論文提到的 prefetch distance 必須大於 memory latency, 觀察 PDIST 提高對 cache-misses 的影響, 突顯 prefetch 過早對效能的影響。
- 最後用 raw counter 觀察更詳細的數據, 建議作者可以說明為什麼選這幾項 event 來測試, 以及解釋這些數據。
###### tags:`zhanyangch` `sysprog2017` `week3`
## 論文閱讀
* [筆記](https://hackmd.io/s/B1KOuuWog)
## 執行環境
```
$ lscpu
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 42
Model name: Intel(R) Core(TM) i7-2640M CPU @ 2.80GHz
製程: 7
CPU MHz: 855.421
CPU max MHz: 3500.0000
CPU min MHz: 800.0000
BogoMIPS: 5587.06
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 4096K
NUMA node0 CPU(s): 0-3
```
## Transpose Matrix
### 一致的介面
修改 Makefile impl.c 使其依 -D 的內容決定實做內容
### SSE transpose
* __m128i _mm_loadu_si128:載入 128byte(即4個 32byte 整數)
* __m128i _mm_unpacklo_epi32(__m128i a, __m128i b):將最低位的 2 個 32byte 依[a0 b0 a1 b1]組合
例如 __m128i T0 = _mm_unpacklo_epi32(I0, I1);
|__m128i|d0|d1|d2|d3|
|:----:|:---:|:---:|:---:|:---:|
|**I0**| i00 | i01 | i02 | i03 |
|**I1**| i10 | i11 | i12 | i13 |
|**T0**| i00 | i10 | i01 | i11 |
* __m128i _mm_unpackhi_epi32(__m128i a, __m128i b):將最高位的 2 個 32byte 依 [a2 b2 a3 b3] 組合
* void _mm_storeu_si128(__m128i *p, __m128i a) :將 a 的資料儲存至 p
* _mm_prefetch(char * p , int i ): 將位置 p prefetch ,i 為 hint ,一次載入一條 cache line
(_MM_HINT_T0, _MM_HINT_T1, _MM_HINT_T2, _MM_HINT_NTA)
## 利用 perf 觀察 cache miss
```
Performance counter stats for './naive_transpose' (5 runs):
18,321,550 cache-misses # 91.193 % of all cache refs ( +- 0.03% )
20,090,955 cache-references ( +- 0.01% )
1,448,992,488 instructions # 1.04 insns per cycle ( +- 0.01% )
1,388,146,906 cycles ( +- 1.86% )
0.476780726 seconds time elapsed ( +- 3.28% )
Performance counter stats for './sse_transpose' (5 runs):
6,044,604 cache-misses # 80.660 % of all cache refs ( +- 0.14% )
7,493,943 cache-references ( +- 0.06% )
1,237,151,919 instructions # 1.33 insns per cycle ( +- 0.02% )
933,002,791 cycles ( +- 1.95% )
0.349286181 seconds time elapsed ( +- 8.65% )
Performance counter stats for './sse_prefetch_transpose' (5 runs):
6,005,813 cache-misses # 80.029 % of all cache refs ( +- 0.10% )
7,504,588 cache-references ( +- 0.12% )
1,283,017,647 instructions # 1.73 insns per cycle ( +- 0.00% )
741,071,615 cycles ( +- 0.37% )
0.293647190 seconds time elapsed ( +- 6.50% )
```
* 很意外的是 sse 跟 sse+prefetch 的 cache miss 在 PDIST 為 8 時差不多,由論文得知 prefetch distance 必須大於 memory latency
* 調整 PFDIST,可以發現轉折點發生在 PDIST=16,112 ,而其他部份則差異不大,當 PDIST < 16,distance < memory latency ,而當 PDIST > 112 ,資料過早被 prefetch 造成 cache miss 上升。
![](https://i.imgur.com/ouELPiN.png)
* 使用 raw counter 的到更詳細的數據,參考[illusion030的共筆](https://hackmd.io/s/HkHDV-moe#raw-counter),在[Intel® 64 and IA-32 Architectures Developer's Manual: Vol. 3B](https://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-software-developer-vol-3b-part-2-manual.html)找到對應處理器型號的PERFORMANCE-MONITORING EVENTS,以下列出的為表19.5的資料
|Event Num.|Umask Value|Event Mask Mnemonic|
|---|---|---|
|4CH|01H|LOAD_HIT_PRE.SW_PF|
|4CH|02H|LOAD_HIT_PRE.HW_PF|
|D1H|01H|MEM_LOAD_UOPS_RETIRED.L1_HIT|
|D1H|02H|MEM_LOAD_UOPS_RETIRED.L2_HIT|
|D1H|04H|MEM_LOAD_UOPS_RETIRED.LLC_HIT|
|D1H|08H|MEM_LOAD_UOPS_RETIRED.L1_MIS|
|D1H|10H|MEM_LOAD_UOPS_RETIRED.L2_MIS|
|D1H|20H|MEM_LOAD_UOPS_RETIRED.LLC_MIS|
指令格式
```
perf stat -e r<Umask Value><Event Num>
```
```
Performance counter stats for './naive_transpose' (10 runs):
4,380 r014c LOAD_HIT_PRE.SW_PF ( +- 20.20% ) (24.90%)
416,936 r024c LOAD_HIT_PRE.HW_PF ( +- 27.46% ) (25.69%)
521,970,614 r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.87% ) (26.13%)
55,233 r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 58.72% ) (25.76%)
457,605 r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT ( +- 17.80% ) (25.39%)
16,221,927 r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS ( +- 0.90% ) (25.18%)
16,138,041 r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS ( +- 1.07% ) (25.06%)
3,110 r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS ( +- 34.97% ) (24.78%)
0.485324876 seconds time elapsed ( +- 2.56% )
Performance counter stats for './sse_transpose' (10 runs):
11,687 r014c LOAD_HIT_PRE.SW_PF ( +- 67.66% ) (24.64%)
613,046 r024c LOAD_HIT_PRE.HW_PF ( +- 27.57% ) (25.75%)
427,907,439 r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 2.01% ) (26.56%)
32,728 r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 53.98% ) (26.24%)
105,153 r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT ( +- 16.43% ) (25.64%)
3,968,212 r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS ( +- 3.31% ) (25.45%)
4,018,123 r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS ( +- 3.23% ) (25.23%)
2,194 r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS ( +- 43.05% ) (24.75%)
0.314186726 seconds time elapsed ( +- 4.80% )
Performance counter stats for './sse_prefetch_transpose' (10 runs):
73,962 r014c LOAD_HIT_PRE.SW_PF ( +- 7.85% ) (24.98%)
645,530 r024c LOAD_HIT_PRE.HW_PF ( +- 27.46% ) (25.76%)
436,823,098 r01d1 MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 1.01% ) (26.42%)
250,519 r02d1 MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 4.97% ) (25.91%)
3,701,145 r04d1 MEM_LOAD_UOPS_RETIRED.LLC_HIT ( +- 1.58% ) (25.79%)
3,950,488 r08d1 MEM_LOAD_UOPS_RETIRED.L1_MIS ( +- 1.36% ) (25.68%)
3,771,673 r10d1 MEM_LOAD_UOPS_RETIRED.L2_MIS ( +- 1.11% ) (25.31%)
1,026 r20d1 MEM_LOAD_UOPS_RETIRED.LLC_MIS ( +- 10.91% ) (24.86%)
0.298254699 seconds time elapsed ( +- 3.08% )
```
可以看到有 sse_prefetch 在 LOAD_HIT_PRE.SW_PF 比其他兩個高出許多