contributed by <illusion030
>
ICC 和 GCC 裡只有簡單的 SW prefetching,所以我們必須手動加入 SW prefetching
出現兩個問題
Architecture: x86_64
CPU 作業模式: 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
每核心執行緒數: 2
每通訊端核心數: 2
Socket(s): 1
NUMA 節點: 1
供應商識別號: GenuineIntel
CPU 家族: 6
型號: 69
Model name: Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
製程: 1
CPU MHz: 1599.920
CPU max MHz: 2600.0000
CPU min MHz: 800.0000
BogoMIPS: 4589.33
虛擬: VT-x
L1d 快取: 32K
L1i 快取: 32K
L2 快取: 256K
L3 快取: 3072K
NUMA node0 CPU(s): 0-3
Architecture / Microarchitecture
Microarchitecture Haswell
Platform Shark Bay
Processor core Haswell
Core stepping C0 (SR170)
Manufacturing process 0.022 micron
Data width 64 bit
The number of CPU cores 2
The number of threads 4
Floating Point Unit Integrated
Level 1 cache size 2 x 32 KB 8-way set associative instruction caches
2 x 32 KB 8-way set associative data caches
Level 2 cache size 2 x 256 KB 8-way set associative caches
Level 3 cache size 3 MB 12-way set associative shared cache
Physical memory 16 GB
Multiprocessing Uniprocessor
Features MMX instructions
SSE / Streaming SIMD Extensions
SSE2 / Streaming SIMD Extensions 2
SSE3 / Streaming SIMD Extensions 3
SSSE3 / Supplemental Streaming SIMD Extensions 3
SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4
AES / Advanced Encryption Standard instructions
AVX / Advanced Vector Extensions
AVX2 / Advanced Vector Extensions 2.0
BMI / BMI1 + BMI2 / Bit Manipulation instructions
F16C / 16-bit Floating-Point conversion instructions
FMA3 / 3-operand Fused Multiply-Add instructions
EM64T / Extended Memory 64 technology / Intel 64
NX / XD / Execute disable bit
HT / Hyper-Threading technology
TBT 2.0 / Turbo Boost technology 2.0
VT-x / Virtualization technology
Low power features Enhanced SpeedStep technology
illusion030@illusion030-X550LD:~$ sudo rdmsr -a 0x1A4
0
0
0
0
$ git clone https://github.com/illusion030/prefetcher
$ cd prefetcher
$ make
$ ./main
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 67599 us
sse: 131430 us
naive: 253113 us
void print128_num(__m128i var)
{
uint16_t *val = (uint16_t*) &var;
printf(" %i %i %i %i %i %i %i %i \n",
val[0], val[1], val[2], val[3], val[4], val[5],
val[6], val[7]);
}
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
After load
I0= 0 0 1 0 2 0 3 0
I1= 4 0 5 0 6 0 7 0
I2= 8 0 9 0 10 0 11 0
I3= 12 0 13 0 14 0 15 0
After first unpack
I0= 0 0 1 0 2 0 3 0
I1= 4 0 5 0 6 0 7 0
I2= 8 0 9 0 10 0 11 0
I3= 12 0 13 0 14 0 15 0
T0= 0 0 4 0 1 0 5 0
T1= 8 0 12 0 9 0 13 0
T2= 2 0 6 0 3 0 7 0
T3= 10 0 14 0 11 0 15 0
After second unpack
I0= 0 0 4 0 8 0 12 0
I1= 1 0 5 0 9 0 13 0
I2= 2 0 6 0 10 0 14 0
I3= 3 0 7 0 11 0 15 0
T0= 0 0 4 0 1 0 5 0
T1= 8 0 12 0 9 0 13 0
T2= 2 0 6 0 3 0 7 0
T3= 10 0 14 0 11 0 15 0
0 4 8 12
1 5 9 13
2 6 10 14
3 7 11 15
sse prefetch: 68962 us
sse prefetch: 75842 us
時間差了約 7000 us
sse prefetch: 128963 us
./sse_prefetch
sse_prefetch: 70545 us
struct object {
int *src;
int *dst;
int w;
int h;
func_t transpose;
};
void init_object_naive(Object **self)
{
...
(*self)->transpose = naive_transpose;
}
void init_object_sse(Object **self)
{
...
(*self)->transpose = sse_transpose;
}
void init_object_sse_prefetch(Object **self)
{
...
(*self)->transpose = sse_prefetch_transpose;
}
naive: main.c
$(CC) $(CFLAGS) -D NAIVE -o naive main.c
sse: main.c
$(CC) $(CFLAGS) -D SSE -o sse main.c
sse_prefetch: main.c
$(CC) $(CFLAGS) -D SSE_PREFETCH -o sse_prefetch main.c
Object *o = NULL;
#ifdef NAIVE
if(init_object_naive(&o) == -1) {
printf("Init object naive fail\n");
return -1;
}
#elif SSE
if(init_object_sse(&o) == -1) {
printf("Init object sse fail\n");
return -1;
}
#elif SSE_PREFETCH
if(init_object_sse_prefetch(&o) == -1) {
printf("Init object sse prefetch fail\n");
return -1;
}
#endif
typedef struct {
void (*transpose)(int *src, int *dst, int w, int h);
} trans_interface;
int naive_init(trans_interface *self)
{
self->transpose = &naive_transpose;
return 0;
}
int sse_init(trans_interface *self)
{
self->transpose = &sse_transpose;
return 0;
}
int sse_prefetch_init(trans_interface *self)
{
self->transpose = &sse_prefetch_transpose;
return 0;
}
Event Num. | Umask Value | Event Mask Mnemonic | Description |
---|---|---|---|
4CH | 01H | LOAD_HIT_PRE.SW_PF | Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch. |
4CH | 02H | LOAD_HIT_PRE.HW_PF | Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch. |
D0H | 81H | MEM_UOPS_RETIRED.ALL_LOADS | All retired load uops |
D1H | 01H | MEM_LOAD_UOPS_RETIRED.L1_HIT | Retired load uops with L1 cache hits as data sources. |
D1H | 02H | MEM_LOAD_UOPS_RETIRED.L2_HIT | Retired load uops with L2 cache hits |
D1H | 04H | MEM_LOAD_UOPS_RETIRED.L3_HIT | Retired load uops with L3 cache hits as data sources. |
D1H | 08H | MEM_LOAD_UOPS_RETIRED.L1_MISS | Retired load uops missed L1 cache as data sources. |
D1H | 10H | MEM_LOAD_UOPS_RETIRED.L2_MISS | Retired load uops missed L2. Unknown data source excluded. |
D1H | 20H | MEM_LOAD_UOPS_RETIRED.L3_MISS | Retired load uops missed L3. Excludes unknown data |
Performance counter stats for './naive' (100 runs):
15,813,284 cache-misses # 97.987 % of all cache refs ( +- 0.33% ) (13.75%)
16,138,211 cache-references ( +- 0.37% ) (14.55%)
19,456,866 L1-dcache-load-misses # 3.57% of all L1-dcache hits ( +- 0.35% ) (21.86%)
545,545,436 L1-dcache-loads ( +- 0.17% ) (21.14%)
198,561,491 L1-dcache-stores ( +- 0.43% ) (19.73%)
47,086 L1-icache-load-misses ( +- 2.16% ) (13.62%)
446 r014c //LOAD_HIT_PRE.SW_PF ( +- 6.35% ) (13.54%)
324,008 r024c //LOAD_HIT_PRE.HW_PF ( +- 7.39% ) (21.03%)
509,653,725 r81d0 //MEM_UOPS_RETIRED.ALL_LOADS ( +- 0.45% ) (19.74%)
474,628,725 r01d1 //MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.38% ) (18.81%)
17,779 r02d1 //MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 9.45% ) (13.77%)
443,351 r04d1 //MEM_LOAD_UOPS_RETIRED.L3_HIT ( +- 1.18% ) (13.68%)
15,306,768 r08d1 //MEM_LOAD_UOPS_RETIRED.L1_MISS ( +- 0.42% ) (13.58%)
15,355,248 r10d1 //MEM_LOAD_UOPS_RETIRED.L2_MISS ( +- 0.38% ) (13.49%)
14,982,092 r20d1 //MEM_LOAD_UOPS_RETIRED.L3_MISS ( +- 0.53% ) (13.40%)
0.440921106 seconds time elapsed ( +- 0.31% )
Performance counter stats for './sse' (100 runs):
4,242,455 cache-misses # 88.814 % of all cache refs ( +- 0.43% ) (13.87%)
4,776,803 cache-references ( +- 0.43% ) (14.98%)
7,758,666 L1-dcache-load-misses # 1.78% of all L1-dcache hits ( +- 0.34% ) (22.39%)
435,901,161 L1-dcache-loads ( +- 0.30% ) (21.24%)
222,396,741 L1-dcache-stores ( +- 0.38% ) (19.23%)
43,121 L1-icache-load-misses ( +- 2.63% ) (13.69%)
373 r014c //LOAD_HIT_PRE.SW_PF ( +- 6.48% ) (13.78%)
295,849 r024c //LOAD_HIT_PRE.HW_PF ( +- 7.43% ) (21.43%)
398,878,132 r81d0 //MEM_UOPS_RETIRED.ALL_LOADS ( +- 0.61% ) (19.64%)
377,177,446 r01d1 //MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.42% ) (18.47%)
8,856 r02d1 //MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 12.83% ) (14.03%)
125,420 r04d1 //MEM_LOAD_UOPS_RETIRED.L3_HIT ( +- 1.94% ) (13.92%)
3,771,031 r08d1 //MEM_LOAD_UOPS_RETIRED.L1_MISS ( +- 0.84% ) (13.74%)
3,793,111 r10d1 //MEM_LOAD_UOPS_RETIRED.L2_MISS ( +- 0.81% ) (13.59%)
3,678,870 r20d1 //MEM_LOAD_UOPS_RETIRED.L3_MISS ( +- 0.83% ) (13.41%)
0.317429370 seconds time elapsed ( +- 0.31% )
Performance counter stats for './sse_prefetch' (100 runs):
3,978,276 cache-misses # 90.202 % of all cache refs ( +- 0.88% ) (13.96%)
4,410,407 cache-references ( +- 0.56% ) (15.38%)
7,640,737 L1-dcache-load-misses # 1.68% of all L1-dcache hits ( +- 0.73% ) (23.12%)
455,983,625 L1-dcache-loads ( +- 0.25% ) (21.90%)
228,291,008 L1-dcache-stores ( +- 0.27% ) (19.32%)
41,051 L1-icache-load-misses ( +- 2.26% ) (13.53%)
738,067 r014c //LOAD_HIT_PRE.SW_PF ( +- 3.08% ) (13.94%)
263,426 r024c //LOAD_HIT_PRE.HW_PF ( +- 8.58% ) (22.27%)
409,935,362 r81d0 //MEM_UOPS_RETIRED.ALL_LOADS ( +- 0.68% ) (19.97%)
405,300,498 r01d1 //MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.54% ) (18.03%)
3,266,225 r02d1 //MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 1.14% ) (14.18%)
20,790 r04d1 //MEM_LOAD_UOPS_RETIRED.L3_HIT ( +- 5.32% ) (14.08%)
3,425,788 r08d1 //MEM_LOAD_UOPS_RETIRED.L1_MISS ( +- 1.46% ) (13.88%)
47,714 r10d1 //MEM_LOAD_UOPS_RETIRED.L2_MISS ( +- 7.27% ) (13.61%)
24,762 r20d1 //MEM_LOAD_UOPS_RETIRED.L3_MISS ( +- 6.40% ) (13.38%)
0.260656173 seconds time elapsed ( +- 0.90% )
void print256_num(__m256i var)
{
uint16_t *val = (uint16_t*) &var;
printf(" %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d\n",
val[0], val[1], val[2], val[3], val[4], val[5],
val[6], val[7], val[8], val[9], val[10], val[11],
val[12], val[13], val[14], val[15]);
}
CFLAGS = -msse2 --std gnu99 -O0 -Wall -Wextra -mavx2
illusion030@illusion030-X550LD:~/Desktop/2017sysprog/prefetcher$ make run
./naive
naive: 289409 us
./sse
sse: 127841 us
./sse_prefetch
sse_prefetch: 70374 us
./avx
avx: 66426 us
Performance counter stats for './avx' (100 runs):
3,092,766 cache-misses # 86.521 % of all cache refs ( +- 0.54% ) (14.04%)
3,574,569 cache-references ( +- 0.63% ) (15.41%)
7,226,514 L1-dcache-load-misses # 1.86% of all L1-dcache hits ( +- 0.75% ) (23.17%)
388,074,299 L1-dcache-loads ( +- 0.23% ) (21.88%)
201,357,224 L1-dcache-stores ( +- 0.33% ) (19.41%)
45,747 L1-icache-load-misses ( +- 2.91% ) (13.78%)
581 r014c //LOAD_HIT_PRE.SW_PF ( +- 10.65% ) (13.95%)
326,686 r024c //LOAD_HIT_PRE.HW_PF ( +- 7.99% ) (21.75%)
357,270,167 r81d0 //MEM_UOPS_RETIRED.ALL_LOADS ( +- 0.79% ) (19.44%)
343,995,831 r01d1 //MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.58% ) (17.84%)
13,222 r02d1 //MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 12.09% ) (14.20%)
80,787 r04d1 //MEM_LOAD_UOPS_RETIRED.L3_HIT ( +- 2.55% ) (14.01%)
951,656 r08d1 //MEM_LOAD_UOPS_RETIRED.L1_MISS ( +- 1.21% ) (13.86%)
944,359 r10d1 //MEM_LOAD_UOPS_RETIRED.L2_MISS ( +- 1.28% ) (13.62%)
879,774 r20d1 //MEM_LOAD_UOPS_RETIRED.L3_MISS ( +- 1.30% ) (13.46%)
0.259288102 seconds time elapsed ( +- 0.38% )
_mm_prefetch(src + (y + AVX_PFDIST + 0) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 1) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 2) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 3) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 4) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 5) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 6) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 7) * w + x, _MM_HINT_T1);
./naive
naive: 244264 us
./sse
sse: 130113 us
./sse_prefetch
sse_prefetch: 74131 us
./avx
avx: 65895 us
./avx_prefetch
avx_prefetch: 67076 us
Performance counter stats for './avx_prefetch' (100 runs):
3,071,554 cache-misses # 61.280 % of all cache refs ( +- 0.52% ) (14.09%)
5,012,325 cache-references ( +- 0.47% ) (15.50%)
6,729,740 L1-dcache-load-misses # 1.68% of all L1-dcache hits ( +- 0.55% ) (23.25%)
400,788,741 L1-dcache-loads ( +- 0.18% ) (22.03%)
200,235,391 L1-dcache-stores ( +- 0.25% ) (19.95%)
64,761 L1-icache-load-misses ( +- 2.66% ) (14.43%)
191,524 r014c //LOAD_HIT_PRE.SW_PF ( +- 3.10% ) (13.55%)
550,186 r024c //LOAD_HIT_PRE.HW_PF ( +- 4.88% ) (20.59%)
385,091,102 r81d0 //MEM_UOPS_RETIRED.ALL_LOADS ( +- 0.73% ) (18.18%)
358,884,415 r01d1 //MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.74% ) (17.42%)
32,042 r02d1 //MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 1.97% ) (14.16%)
553,243 r04d1 //MEM_LOAD_UOPS_RETIRED.L3_HIT ( +- 1.15% ) (14.16%)
1,182,803 r08d1 //MEM_LOAD_UOPS_RETIRED.L1_MISS ( +- 1.19% ) (13.92%)
1,194,809 r10d1 //MEM_LOAD_UOPS_RETIRED.L2_MISS ( +- 0.93% ) (13.64%)
627,029 r20d1 //MEM_LOAD_UOPS_RETIRED.L3_MISS ( +- 0.59% ) (13.38%)
0.261470543 seconds time elapsed ( +- 0.21% )
illusion030@illusion030-X550LD:~/Desktop/memory_latency$ sudo modprobe msr
illusion030@illusion030-X550LD:~/Desktop/memory_latency$ ./mlc
Intel(R) Memory Latency Checker - v3.1a
tried to open /dev/cpu/0/msr
Cannot access MSR module. Possibilities include not having root privileges, or MSR module not inserted (try 'modprobe msr', and refer README)
root@illusion030-X550LD:/home/illusion030/Desktop/memory_latency# ./mlc
Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
Memory node
Socket 0
0 62.3
Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads : 19368.4
3:1 Reads-Writes : 16337.4
2:1 Reads-Writes : 16425.5
1:1 Reads-Writes : 18360.1
Stream-triad like: 17216.6
Measuring Memory Bandwidths between nodes within system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Memory node
Socket 0
0 19441.2
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 74.88 17074.9
00002 74.87 17084.0
00008 74.79 17060.9
00015 74.34 16855.8
00050 66.22 9211.8
00100 65.24 6029.9
00200 63.82 3733.0
00300 64.71 2855.0
00400 63.50 2424.7
00500 63.97 2143.7
00700 63.72 1828.5
01000 63.42 1589.5
01300 63.21 1460.4
01700 63.10 1357.4
02500 63.07 1248.7
03500 63.06 1182.3
05000 62.99 1133.5
09000 62.96 1081.9
20000 62.99 1045.7
Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT latency 30.4
Local Socket L2->L2 HITM latency 35.7
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject Latency Bandwidth
Delay (ns) MB/sec
==========================
00000 86.82 15573.2
00002 85.72 15515.8
00008 85.22 15528.7
00015 86.11 15460.6
00050 75.05 8977.5
00100 73.72 5871.2
00200 71.39 3587.3
00300 72.28 2730.4
00400 71.15 2291.8
00500 70.06 2041.5
00700 70.79 1712.6
01000 70.38 1474.3
01300 70.33 1347.3
01700 69.51 1259.1
02500 69.82 1146.5
03500 69.62 1083.9
05000 69.46 1037.1
09000 69.98 978.6
20000 69.15 954.5
loop time = 65344 us
avx: 65358 us
naive: 247381 us
sse: 128336 us
sse_prefetch: 71009 us
avx: 66292 us
avx_prefetch: 64742 us
Performance counter stats for './avx_prefetch' (100 runs):
3,162,625 cache-misses # 79.583 % of all cache refs ( +- 0.44% ) (14.07%)
3,973,992 cache-references ( +- 0.47% ) (15.46%)
7,050,577 L1-dcache-load-misses # 1.75% of all L1-dcache hits ( +- 0.53% ) (23.21%)
403,421,197 L1-dcache-loads ( +- 0.14% ) (21.87%)
202,331,106 L1-dcache-stores ( +- 0.17% ) (19.32%)
57,234 L1-icache-load-misses ( +- 2.94% ) (13.47%)
344,731 r014c //LOAD_HIT_PRE.SW_PF ( +- 2.17% ) (13.63%)
262,287 r024c //LOAD_HIT_PRE.HW_PF ( +- 8.04% ) (22.16%)
353,934,197 r81d0 //MEM_UOPS_RETIRED.ALL_LOADS ( +- 0.68% ) (19.86%)
349,566,969 r01d1 //MEM_LOAD_UOPS_RETIRED.L1_HIT ( +- 0.42% ) (18.09%)
780,509 r02d1 //MEM_LOAD_UOPS_RETIRED.L2_HIT ( +- 0.47% ) (14.26%)
50,093 r04d1 //MEM_LOAD_UOPS_RETIRED.L3_HIT ( +- 2.24% ) (14.12%)
1,227,596 r08d1 //MEM_LOAD_UOPS_RETIRED.L1_MISS ( +- 0.72% ) (13.87%)
452,265 r10d1 //MEM_LOAD_UOPS_RETIRED.L2_MISS ( +- 0.87% ) (13.66%)
400,909 r20d1 //MEM_LOAD_UOPS_RETIRED.L3_MISS ( +- 0.91% ) (13.44%)
0.256960590 seconds time elapsed ( +- 0.18% )
cache hit 的部份有比較跟預期的一樣, L2 LOAD HIT 有增加
因為 l / s 實在太小了只有大約 0.3,沒辦法取到一個很好的 PFDIST,所以 AVX+prefetch 在這個程式上沒有很好的表現