owned this note
owned this note
Published
Linked with GitHub
# 2016q3 Homework3 (mergesort-concurrent)
contributed by <`f5120125`>
###### tags: `sysprog`
```clike=
$ ./main
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 200603 us
sse: 234827 us
naive: 378814 us
```
- 使用Perf分析cache miss
```clike=
$ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main
Performance counter stats for './main' (50 runs):
27,985,354 cache-misses # 84.980 % of all cache refs (33.29%)
31,806,730 cache-references (50.24%)
34,757,822 L1-dcache-load-misses # 4.34% of all L1-dcache hits (66.90%)
806,498,136 L1-dcache-loads (66.16%)
352,545,104 L1-dcache-stores (64.64%)
377,450 L1-icache-load-misses (33.11%)
0.731182840 seconds time elapsed ( +- 3.65% )
```
### 更改 Makefile
```clike=
$ make cache-test
perf stat -e cache-misses,cache-references,instructions,cycles ./naive_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
naive: 322006 us
Performance counter stats for './naive_transpose':
18,959,975 cache-misses # 93.414 % of all cache refs
20,296,767 cache-references
1,572,649,544 instructions # 0.89 insns per cycle
1,761,874,413 cycles
0.511842859 seconds time elapsed
perf stat -e cache-misses,cache-references,instructions,cycles ./sse_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse: 160067 us
Performance counter stats for './sse_transpose':
6,657,239 cache-misses # 84.130 % of all cache refs
7,913,069 cache-references
1,421,957,448 instructions # 1.13 insns per cycle
1,253,144,039 cycles
0.333203789 seconds time elapsed
perf stat -e cache-misses,cache-references,instructions,cycles ./sse_prefetch_transpose
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 86066 us
Performance counter stats for './sse_prefetch_transpose':
6,592,423 cache-misses # 83.190 % of all cache refs
7,924,577 cache-references
1,407,620,222 instructions # 1.54 insns per cycle
914,875,974 cycles
0.283419155 seconds time elapsed
```
## 效能差異
### 矩陣轉置
#### TRANSPOSE_IMPL
記憶體空間是 Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置造成很高比例的 cache-misses.
### naive_transpose
最簡單的矩陣轉置想法最直覺的矩陣轉置,從矩陣的左上角第一個元素開始,計算每一個元素轉置之後的目的地,並且存入成為新矩陣。
```clike=
void naive_transpose(int *src, int *dst, int w, int h)
{
for(int x = 0; x < w; x++){
for(int y = 0; y < h; y++){
*(dst + x*h + y) = *(src + y*w + x);
}
}
}
```
### sse_transpose
使用了 Intel 處理器 [SIMD](https://en.wikipedia.org/wiki/SIMD) 的技術,每個 int element 為 32-bit,所以 4x4 矩陣的每一個 row 剛好可以打包成 128-bit,再透過高低位的 swap,達到轉置的效
相較之下`naive transpose` 執行次數多且需要做 jump,增加 cache-misses 機會,`sse transpose` 更快。
#### 效能改進
* 一次將 4 筆數據放入 sse 暫存器中,一條指令處理 4 筆數據,比 4 筆數據 4 條指令處理快
* loop unrolling:
* 執行 loop 循環的組合語言代碼執行次數會變少
* branch prediction miss 機率降低
> [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html)
```clike=
void sse_transpose(int *src, int *dst, int w, int h)
{
for (int x = 0; x < w; x += 4) {
for (int y = 0; y < h; y += 4) {
__m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x));
__m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x));
__m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x));
__m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x));
__m128i T0 = _mm_unpacklo_epi32(I0, I1);
__m128i T1 = _mm_unpacklo_epi32(I2, I3);
__m128i T2 = _mm_unpackhi_epi32(I0, I1);
__m128i T3 = _mm_unpackhi_epi32(I2, I3);
I0 = _mm_unpacklo_epi64(T0, T1);
I1 = _mm_unpackhi_epi64(T0, T1);
I2 = _mm_unpacklo_epi64(T2, T3);
I3 = _mm_unpackhi_epi64(T2, T3);
_mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0);
_mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1);
_mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2);
_mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3);
}
}
}
```