2016q3 Homework3 (mergesort-concurrent)

# 2016q3 Homework3 (mergesort-concurrent) contributed by <`f5120125`> ###### tags: `sysprog` ```clike= $ ./main 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 200603 us sse: 234827 us naive: 378814 us ``` - 使用Perf分析cache miss ```clike= $ perf stat -r 50 -e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses ./main Performance counter stats for './main' (50 runs): 27,985,354 cache-misses # 84.980 % of all cache refs (33.29%) 31,806,730 cache-references (50.24%) 34,757,822 L1-dcache-load-misses # 4.34% of all L1-dcache hits (66.90%) 806,498,136 L1-dcache-loads (66.16%) 352,545,104 L1-dcache-stores (64.64%) 377,450 L1-icache-load-misses (33.11%) 0.731182840 seconds time elapsed ( +- 3.65% ) ``` ### 更改 Makefile ```clike= $ make cache-test perf stat -e cache-misses,cache-references,instructions,cycles ./naive_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 naive: 322006 us Performance counter stats for './naive_transpose': 18,959,975 cache-misses # 93.414 % of all cache refs 20,296,767 cache-references 1,572,649,544 instructions # 0.89 insns per cycle 1,761,874,413 cycles 0.511842859 seconds time elapsed perf stat -e cache-misses,cache-references,instructions,cycles ./sse_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse: 160067 us Performance counter stats for './sse_transpose': 6,657,239 cache-misses # 84.130 % of all cache refs 7,913,069 cache-references 1,421,957,448 instructions # 1.13 insns per cycle 1,253,144,039 cycles 0.333203789 seconds time elapsed perf stat -e cache-misses,cache-references,instructions,cycles ./sse_prefetch_transpose 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0 4 8 12 1 5 9 13 2 6 10 14 3 7 11 15 sse prefetch: 86066 us Performance counter stats for './sse_prefetch_transpose': 6,592,423 cache-misses # 83.190 % of all cache refs 7,924,577 cache-references 1,407,620,222 instructions # 1.54 insns per cycle 914,875,974 cycles 0.283419155 seconds time elapsed ``` ## 效能差異 ### 矩陣轉置 #### TRANSPOSE_IMPL 記憶體空間是 Row Major,所以在讀取列時是讀取連續的記憶體空間,寫入行則是寫入不連續的記憶體空間,所以矩陣轉置造成很高比例的 cache-misses. ### naive_transpose 最簡單的矩陣轉置想法最直覺的矩陣轉置，從矩陣的左上角第一個元素開始，計算每一個元素轉置之後的目的地，並且存入成為新矩陣。 ```clike= void naive_transpose(int *src, int *dst, int w, int h) { for(int x = 0; x < w; x++){ for(int y = 0; y < h; y++){ *(dst + x*h + y) = *(src + y*w + x); } } } ``` ### sse_transpose 使用了 Intel 處理器 [SIMD](https://en.wikipedia.org/wiki/SIMD) 的技術，每個 int element 為 32-bit，所以 4x4 矩陣的每一個 row 剛好可以打包成 128-bit，再透過高低位的 swap，達到轉置的效相較之下`naive transpose` 執行次數多且需要做 jump，增加 cache-misses 機會，`sse transpose` 更快。 #### 效能改進 * 一次將 4 筆數據放入 sse 暫存器中，一條指令處理 4 筆數據，比 4 筆數據 4 條指令處理快 * loop unrolling： * 執行 loop 循環的組合語言代碼執行次數會變少 * branch prediction miss 機率降低 > [Programming trivia: 4x4 integer matrix transpose in SSE2](https://www.randombit.net/bitbashing/2009/10/08/integer_matrix_transpose_in_sse2.html) ```clike= void sse_transpose(int *src, int *dst, int w, int h) { for (int x = 0; x < w; x += 4) { for (int y = 0; y < h; y += 4) { __m128i I0 = _mm_loadu_si128((__m128i *)(src + (y + 0) * w + x)); __m128i I1 = _mm_loadu_si128((__m128i *)(src + (y + 1) * w + x)); __m128i I2 = _mm_loadu_si128((__m128i *)(src + (y + 2) * w + x)); __m128i I3 = _mm_loadu_si128((__m128i *)(src + (y + 3) * w + x)); __m128i T0 = _mm_unpacklo_epi32(I0, I1); __m128i T1 = _mm_unpacklo_epi32(I2, I3); __m128i T2 = _mm_unpackhi_epi32(I0, I1); __m128i T3 = _mm_unpackhi_epi32(I2, I3); I0 = _mm_unpacklo_epi64(T0, T1); I1 = _mm_unpackhi_epi64(T0, T1); I2 = _mm_unpacklo_epi64(T2, T3); I3 = _mm_unpackhi_epi64(T2, T3); _mm_storeu_si128((__m128i *)(dst + ((x + 0) * h) + y), I0); _mm_storeu_si128((__m128i *)(dst + ((x + 1) * h) + y), I1); _mm_storeu_si128((__m128i *)(dst + ((x + 2) * h) + y), I2); _mm_storeu_si128((__m128i *)(dst + ((x + 3) * h) + y), I3); } } } ```

Syntax	Example	Reference
# Header	Header	基本排版
- Unordered List	Unordered List
1. Ordered List	Ordered List
- [ ] Todo List	Todo List
> Blockquote	Blockquote
Bold font	Bold font
Italics font	Italics font
~~Strikethrough~~	~~Strikethrough~~
19^th^	19^th
H~2~O	H₂O
++Inserted text++	Inserted text
==Marked text==	Marked text
[link text](https:// "title")	Link
![image alt](https:// "title")	Image
`Code`	`Code`	在筆記中貼入程式碼
```javascript var i = 0; ```	`var i = 0;`
:smile:		Emoji list
{%youtube youtube_id %}	Externals
$L^aT_eX$	L^aT_eX
:::info This is a alert area. :::	This is a alert area.