Try   HackMD

2017q1 Homework3 (software-pipelining)

contributed by <illusion030>


When Prefetching Works, When It Doesn’t, and Why

  • 閱讀 When Prefetching Works, When It Doesn’t, and Why

  • ICC 和 GCC 裡只有簡單的 SW prefetching,所以我們必須手動加入 SW prefetching

  • 出現兩個問題

    • 對於如何最佳的使用 prefetching 的文件很少
    • 對於 SW prefetching 和 HW prefetching 之間互動的複雜度並沒有很了解

開發環境

Architecture:          x86_64
CPU 作業模式:    	32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數:		2
每通訊端核心數:		2
Socket(s):             1
NUMA 節點:         	1
供應商識別號:  		GenuineIntel
CPU 家族:          	6
型號:              	69
Model name:            Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
製程:              	1
CPU MHz:             	1599.920
CPU max MHz:           2600.0000
CPU min MHz:           800.0000
BogoMIPS:              4589.33
虛擬:              	VT-x
L1d 快取:          	32K
L1i 快取:          	32K
L2 快取:           	256K
L3 快取:           	3072K
NUMA node0 CPU(s):     0-3
Architecture / Microarchitecture
Microarchitecture        Haswell
Platform		  Shark Bay
Processor core   	  Haswell
Core stepping   	  C0 (SR170)
Manufacturing process	  0.022 micron
Data width		  64 bit
The number of CPU cores  2
The number of threads	  4
Floating Point Unit	  Integrated
Level 1 cache size  	  2 x 32 KB 8-way set associative instruction caches 
			  2 x 32 KB 8-way set associative data caches
Level 2 cache size  	  2 x 256 KB 8-way set associative caches
Level 3 cache size	  3 MB 12-way set associative shared cache
Physical memory		  16 GB
Multiprocessing		  Uniprocessor
Features		  MMX instructions
			  SSE / Streaming SIMD Extensions
			  SSE2 / Streaming SIMD Extensions 2
			  SSE3 / Streaming SIMD Extensions 3
			  SSSE3 / Supplemental Streaming SIMD Extensions 3
			  SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4  
			  AES / Advanced Encryption Standard instructions
			  AVX / Advanced Vector Extensions
			  AVX2 / Advanced Vector Extensions 2.0
			  BMI / BMI1 + BMI2 / Bit Manipulation instructions
			  F16C / 16-bit Floating-Point conversion instructions
			  FMA3 / 3-operand Fused Multiply-Add instructions
			  EM64T / Extended Memory 64 technology / Intel 64  
			  NX / XD / Execute disable bit  
			  HT / Hyper-Threading technology   
			  TBT 2.0 / Turbo Boost technology 2.0   
			  VT-x / Virtualization technology 
Low power features	  Enhanced SpeedStep technology  
  • Microarchitecture 為 Haswell
illusion030@illusion030-X550LD:~$ sudo rdmsr -a 0x1A4
0
0
0
0
  • 4個 processor 的 4個 HW prefetcher 都是 enable 的狀態

開發紀錄

$ git clone https://github.com/illusion030/prefetcher
$ cd prefetcher
$ make
$ ./main
  • 執行結果
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 67599 us
sse: 		 131430 us
naive: 		 253113 us
  • 看一下程式碼,在 main.c 中前半段為測試結果是否正確,後半段測出 sse prefetch、sse、naive 個別的時間

SSE

void print128_num(__m128i var)
{
    uint16_t *val = (uint16_t*) &var;
    printf(" %i %i %i %i %i %i %i %i \n",
           val[0], val[1], val[2], val[3], val[4], val[5],
           val[6], val[7]);
}
  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

After load
I0= 0 0 1 0 2 0 3 0 
I1= 4 0 5 0 6 0 7 0 
I2= 8 0 9 0 10 0 11 0 
I3= 12 0 13 0 14 0 15 0 
After first unpack
I0= 0 0 1 0 2 0 3 0 
I1= 4 0 5 0 6 0 7 0 
I2= 8 0 9 0 10 0 11 0 
I3= 12 0 13 0 14 0 15 0 
T0= 0 0 4 0 1 0 5 0 
T1= 8 0 12 0 9 0 13 0 
T2= 2 0 6 0 3 0 7 0 
T3= 10 0 14 0 11 0 15 0 
After second unpack
I0= 0 0 4 0 8 0 12 0 
I1= 1 0 5 0 9 0 13 0 
I2= 2 0 6 0 10 0 14 0 
I3= 3 0 7 0 11 0 15 0 
T0= 0 0 4 0 1 0 5 0 
T1= 8 0 12 0 9 0 13 0 
T2= 2 0 6 0 3 0 7 0 
T3= 10 0 14 0 11 0 15 0 
  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
  • _mm_unpacklo_epi32 是將兩個參數的 lower bits 以 32 bits 為單位輪流排序並 return
  • _mm_unpackhi_epi32 是將兩個參數的 higher bits 以 32 bits 為單位輪流排序並 return
  • _mm_unpacklo_epi64 是將兩個參數的 lower bits 以 64 bits 為單位輪流排序並 return
  • _mm_unpackhi_epi64 是將兩個參數的 higher bits 以 64 bits 為單位輪流排序並 return
  • 藉此完成了矩陣轉置

Prefetch

  • _mm_prefetch(char * p , int i ) 會將 address p 裡的數值加載到更靠近 processor 的位置
  • Hint 有四種值
  • 使用 prefetch 後時間有明顯的縮短
sse prefetch: 	 68962 us
  • 將 PFDIST 從 8 改成 4 看看
sse prefetch: 	 75842 us

時間差了約 7000 us

  • 將 Hint 改為 T0、T1、T2 三者沒有很大的差別,但改為 NTA 後時間大幅增加
sse prefetch: 	 128963 us
  • 自己對 _mm_prefetch 的理解,在運算時可以同時 prefetch 之後會用到的資料,減少下一次運算需要載入資料的時間,所以當 PFDIST = 8、Hint = NTA 時,先載入了下兩次要用的資料到 L1 cache 並標示為 Non-temporal,但是下一次運算不會立即用到,所以有可能在下一次運算時又把他取代了
  • 將 PFDIST = 4、Hint = NTA,因為立刻被用到所以時間減少了
./sse_prefetch
sse_prefetch: 	 70545 us

Object-Oriented Programming

struct object {
    int *src;
    int *dst;
    int w;
    int h;
    func_t transpose;
};
  • 用三個 init 來讓 transpose 指向不同的 transpose function
void init_object_naive(Object **self)
{
    ...
    (*self)->transpose = naive_transpose;
}

void init_object_sse(Object **self)
{
    ...
    (*self)->transpose = sse_transpose;
}

void init_object_sse_prefetch(Object **self)
{
    ...
    (*self)->transpose = sse_prefetch_transpose;
}
  • 在 Makefile 中加上 -D 來區別不同的執行檔需要 define 的參數
naive: main.c
    $(CC) $(CFLAGS) -D NAIVE -o naive main.c

sse: main.c
    $(CC) $(CFLAGS) -D SSE -o sse main.c

sse_prefetch: main.c
    $(CC) $(CFLAGS) -D SSE_PREFETCH -o sse_prefetch main.c
  • 在 main.c 中用 ifdef 來實現不同的 init
    Object *o = NULL;
#ifdef NAIVE
    if(init_object_naive(&o) == -1) {
        printf("Init object naive fail\n");                            
        return -1;
    }
#elif SSE
    if(init_object_sse(&o) == -1) {
        printf("Init object sse fail\n");
        return -1;
    }
#elif SSE_PREFETCH
    if(init_object_sse_prefetch(&o) == -1) {
        printf("Init object sse prefetch fail\n");
        return -1;
    }
#endif

Implement a interface

  • 應該要做一個 interface,所以修改程式碼為
typedef struct {
    void (*transpose)(int *src, int *dst, int w, int h); 
} trans_interface;
int naive_init(trans_interface *self)
{
    self->transpose = &naive_transpose;
    return 0;
}

int sse_init(trans_interface *self)
{
    self->transpose = &sse_transpose;
    return 0;
}

int sse_prefetch_init(trans_interface *self)
{
    self->transpose = &sse_prefetch_transpose;
    return 0;
}

Raw counter

Event Num. Umask Value Event Mask Mnemonic Description
4CH 01H LOAD_HIT_PRE.SW_PF Non-SW-prefetch load dispatches that hit fill buffer allocated for S/W prefetch.
4CH 02H LOAD_HIT_PRE.HW_PF Non-SW-prefetch load dispatches that hit fill buffer allocated for H/W prefetch.
D0H 81H MEM_UOPS_RETIRED.ALL_LOADS All retired load uops
D1H 01H MEM_LOAD_UOPS_RETIRED.L1_HIT Retired load uops with L1 cache hits as data sources.
D1H 02H MEM_LOAD_UOPS_RETIRED.L2_HIT Retired load uops with L2 cache hits
D1H 04H MEM_LOAD_UOPS_RETIRED.L3_HIT Retired load uops with L3 cache hits as data sources.
D1H 08H MEM_LOAD_UOPS_RETIRED.L1_MISS Retired load uops missed L1 cache as data sources.
D1H 10H MEM_LOAD_UOPS_RETIRED.L2_MISS Retired load uops missed L2. Unknown data source excluded.
D1H 20H MEM_LOAD_UOPS_RETIRED.L3_MISS Retired load uops missed L3. Excludes unknown data
  • 用 perf stat 觀察
Performance counter stats for './naive' (100 runs):

        15,813,284      cache-misses              #   97.987 % of all cache refs      ( +-  0.33% )  (13.75%)
        16,138,211      cache-references                                              ( +-  0.37% )  (14.55%)
        19,456,866      L1-dcache-load-misses     #    3.57% of all L1-dcache hits    ( +-  0.35% )  (21.86%)
       545,545,436      L1-dcache-loads                                               ( +-  0.17% )  (21.14%)
       198,561,491      L1-dcache-stores                                              ( +-  0.43% )  (19.73%)
            47,086      L1-icache-load-misses                                         ( +-  2.16% )  (13.62%)
               446      r014c    //LOAD_HIT_PRE.SW_PF                                                     ( +-  6.35% )  (13.54%)
           324,008      r024c    //LOAD_HIT_PRE.HW_PF                                                     ( +-  7.39% )  (21.03%)
       509,653,725      r81d0    //MEM_UOPS_RETIRED.ALL_LOADS                                                     ( +-  0.45% )  (19.74%)
       474,628,725      r01d1    //MEM_LOAD_UOPS_RETIRED.L1_HIT                                                      ( +-  0.38% )  (18.81%)
            17,779      r02d1    //MEM_LOAD_UOPS_RETIRED.L2_HIT                                                     ( +-  9.45% )  (13.77%)
           443,351      r04d1    //MEM_LOAD_UOPS_RETIRED.L3_HIT                                                      ( +-  1.18% )  (13.68%)
        15,306,768      r08d1    //MEM_LOAD_UOPS_RETIRED.L1_MISS                                                     ( +-  0.42% )  (13.58%)
        15,355,248      r10d1    //MEM_LOAD_UOPS_RETIRED.L2_MISS                                                     ( +-  0.38% )  (13.49%)
        14,982,092      r20d1    //MEM_LOAD_UOPS_RETIRED.L3_MISS                                                    ( +-  0.53% )  (13.40%)

       0.440921106 seconds time elapsed                                          ( +-  0.31% )

Performance counter stats for './sse' (100 runs):

         4,242,455      cache-misses              #   88.814 % of all cache refs      ( +-  0.43% )  (13.87%)
         4,776,803      cache-references                                              ( +-  0.43% )  (14.98%)
         7,758,666      L1-dcache-load-misses     #    1.78% of all L1-dcache hits    ( +-  0.34% )  (22.39%)
       435,901,161      L1-dcache-loads                                               ( +-  0.30% )  (21.24%)
       222,396,741      L1-dcache-stores                                              ( +-  0.38% )  (19.23%)
            43,121      L1-icache-load-misses                                         ( +-  2.63% )  (13.69%)
               373      r014c    //LOAD_HIT_PRE.SW_PF                                                     ( +-  6.48% )  (13.78%)
           295,849      r024c    //LOAD_HIT_PRE.HW_PF                                                      ( +-  7.43% )  (21.43%)
       398,878,132      r81d0    //MEM_UOPS_RETIRED.ALL_LOADS                                                     ( +-  0.61% )  (19.64%)
       377,177,446      r01d1    //MEM_LOAD_UOPS_RETIRED.L1_HIT                                                    ( +-  0.42% )  (18.47%)
             8,856      r02d1    //MEM_LOAD_UOPS_RETIRED.L2_HIT                                                    ( +- 12.83% )  (14.03%)
           125,420      r04d1    //MEM_LOAD_UOPS_RETIRED.L3_HIT                                                    ( +-  1.94% )  (13.92%)
         3,771,031      r08d1    //MEM_LOAD_UOPS_RETIRED.L1_MISS                                                    ( +-  0.84% )  (13.74%)
         3,793,111      r10d1    //MEM_LOAD_UOPS_RETIRED.L2_MISS                                                   ( +-  0.81% )  (13.59%)
         3,678,870      r20d1    //MEM_LOAD_UOPS_RETIRED.L3_MISS                                                   ( +-  0.83% )  (13.41%)

       0.317429370 seconds time elapsed                                          ( +-  0.31% )
  
Performance counter stats for './sse_prefetch' (100 runs):

         3,978,276      cache-misses              #   90.202 % of all cache refs      ( +-  0.88% )  (13.96%)
         4,410,407      cache-references                                              ( +-  0.56% )  (15.38%)
         7,640,737      L1-dcache-load-misses     #    1.68% of all L1-dcache hits    ( +-  0.73% )  (23.12%)
       455,983,625      L1-dcache-loads                                               ( +-  0.25% )  (21.90%)
       228,291,008      L1-dcache-stores                                              ( +-  0.27% )  (19.32%)
            41,051      L1-icache-load-misses                                         ( +-  2.26% )  (13.53%)
           738,067      r014c    //LOAD_HIT_PRE.SW_PF                                                      ( +-  3.08% )  (13.94%)
           263,426      r024c    //LOAD_HIT_PRE.HW_PF                                                     ( +-  8.58% )  (22.27%)
       409,935,362      r81d0    //MEM_UOPS_RETIRED.ALL_LOADS                                                     ( +-  0.68% )  (19.97%)
       405,300,498      r01d1    //MEM_LOAD_UOPS_RETIRED.L1_HIT                                                     ( +-  0.54% )  (18.03%)
         3,266,225      r02d1    //MEM_LOAD_UOPS_RETIRED.L2_HIT                                                     ( +-  1.14% )  (14.18%)
            20,790      r04d1    //MEM_LOAD_UOPS_RETIRED.L3_HIT                                                     ( +-  5.32% )  (14.08%)
         3,425,788      r08d1    //MEM_LOAD_UOPS_RETIRED.L1_MISS                                                     ( +-  1.46% )  (13.88%)
            47,714      r10d1    //MEM_LOAD_UOPS_RETIRED.L2_MISS                                                     ( +-  7.27% )  (13.61%)
            24,762      r20d1    //MEM_LOAD_UOPS_RETIRED.L3_MISS                                                     ( +-  6.40% )  (13.38%)

       0.260656173 seconds time elapsed                                          ( +-  0.90% )
  • LOAD_HIT_PRE.SW_PF 的部份 sse_prefetch 特別高,也代表真的有用到 SW prefetch
  • LOAD_HIT_PRE.HW_PF 的部份都差不多,因為三個用到的 HW prefetch 都是全開的狀態
  • 發現 sse_prefetch 在 L2 cache HIT 的情況增加許多,而 且 L2 cache MISS 減少許多,因為使用 _MM_HINT_T1 的關係,將資料 prefetch 到 L2 跟 L3 了

AVX

void print256_num(__m256i var)
{
    uint16_t *val = (uint16_t*) &var;
    printf(" %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d %2d\n",
          val[0], val[1], val[2], val[3], val[4], val[5], 
          val[6], val[7], val[8], val[9], val[10], val[11],
          val[12], val[13], val[14], val[15]);
}
  • 最後用 _mm256_unpacklo_epi32、_mm256_unpacklo_epi32、_mm256_permute2x128_si256 完成矩陣的轉置
  • 在編譯時多加 -mavx2
CFLAGS = -msse2 --std gnu99 -O0 -Wall -Wextra -mavx2
  • 實際跑跑看
illusion030@illusion030-X550LD:~/Desktop/2017sysprog/prefetcher$ make run
./naive
naive: 	 289409 us
./sse
sse: 	 127841 us
./sse_prefetch
sse_prefetch: 	 70374 us
./avx
avx: 	 66426 us
  • 用 AVX 的速度明顯快很多,因為是一次算 8 * 8 的矩陣,跟 SSE 一次算 4 * 4 比較,需要算的次數變少許多
  • 用 perf stat 分析看看
Performance counter stats for './avx' (100 runs):

         3,092,766      cache-misses              #   86.521 % of all cache refs      ( +-  0.54% )  (14.04%)
         3,574,569      cache-references                                              ( +-  0.63% )  (15.41%)
         7,226,514      L1-dcache-load-misses     #    1.86% of all L1-dcache hits    ( +-  0.75% )  (23.17%)
       388,074,299      L1-dcache-loads                                               ( +-  0.23% )  (21.88%)
       201,357,224      L1-dcache-stores                                              ( +-  0.33% )  (19.41%)
            45,747      L1-icache-load-misses                                         ( +-  2.91% )  (13.78%)
               581      r014c    //LOAD_HIT_PRE.SW_PF                                                      ( +- 10.65% )  (13.95%)
           326,686      r024c    //LOAD_HIT_PRE.HW_PF                                                     ( +-  7.99% )  (21.75%)
       357,270,167      r81d0    //MEM_UOPS_RETIRED.ALL_LOADS                                                     ( +-  0.79% )  (19.44%)
       343,995,831      r01d1    //MEM_LOAD_UOPS_RETIRED.L1_HIT                                                     ( +-  0.58% )  (17.84%)
            13,222      r02d1    //MEM_LOAD_UOPS_RETIRED.L2_HIT                                                     ( +- 12.09% )  (14.20%)
            80,787      r04d1    //MEM_LOAD_UOPS_RETIRED.L3_HIT                                                     ( +-  2.55% )  (14.01%)
           951,656      r08d1    //MEM_LOAD_UOPS_RETIRED.L1_MISS                                                     ( +-  1.21% )  (13.86%)
           944,359      r10d1    //MEM_LOAD_UOPS_RETIRED.L2_MISS                                                   ( +-  1.28% )  (13.62%)
           879,774      r20d1    //MEM_LOAD_UOPS_RETIRED.L3_MISS                                                    ( +-  1.30% )  (13.46%)

       0.259288102 seconds time elapsed                                          ( +-  0.38% )
  • 每一層 cache 的 load misses 都比 sse 少很多

AVX Prefetch

  • 因為 AVX 和 AVX2 的函式庫裡沒有 Prefetch 的 Intrinsic,所以先用 SSE2 的 _mm_prefetch
  • prefetch 8 次
_mm_prefetch(src + (y + AVX_PFDIST + 0) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 1) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 2) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 3) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 4) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 5) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 6) * w + x, _MM_HINT_T1);
_mm_prefetch(src + (y + AVX_PFDIST + 7) * w + x, _MM_HINT_T1);
  • 用 PFDIST = 16 (下兩次)、Hint = T1 跑跑看
./naive
naive: 	 244264 us
./sse
sse: 	 130113 us
./sse_prefetch
sse_prefetch: 	 74131 us
./avx
avx: 	 65895 us
./avx_prefetch
avx_prefetch: 	 67076 us
  • 耗的時間反而提高了,用 perf stat 觀察看看
Performance counter stats for './avx_prefetch' (100 runs):

         3,071,554      cache-misses              #   61.280 % of all cache refs      ( +-  0.52% )  (14.09%)
         5,012,325      cache-references                                              ( +-  0.47% )  (15.50%)
         6,729,740      L1-dcache-load-misses     #    1.68% of all L1-dcache hits    ( +-  0.55% )  (23.25%)
       400,788,741      L1-dcache-loads                                               ( +-  0.18% )  (22.03%)
       200,235,391      L1-dcache-stores                                              ( +-  0.25% )  (19.95%)
            64,761      L1-icache-load-misses                                         ( +-  2.66% )  (14.43%)
           191,524      r014c    //LOAD_HIT_PRE.SW_PF                                                    ( +-  3.10% )  (13.55%)
           550,186      r024c    //LOAD_HIT_PRE.HW_PF                                                      ( +-  4.88% )  (20.59%)
       385,091,102      r81d0    //MEM_UOPS_RETIRED.ALL_LOADS                                                      ( +-  0.73% )  (18.18%)
       358,884,415      r01d1    //MEM_LOAD_UOPS_RETIRED.L1_HIT                                                     ( +-  0.74% )  (17.42%)
            32,042      r02d1    //MEM_LOAD_UOPS_RETIRED.L2_HIT                                                     ( +-  1.97% )  (14.16%)
           553,243      r04d1    //MEM_LOAD_UOPS_RETIRED.L3_HIT                                                     ( +-  1.15% )  (14.16%)
         1,182,803      r08d1    //MEM_LOAD_UOPS_RETIRED.L1_MISS                                                    ( +-  1.19% )  (13.92%)
         1,194,809      r10d1    //MEM_LOAD_UOPS_RETIRED.L2_MISS                                                   ( +-  0.93% )  (13.64%)
           627,029      r20d1    //MEM_LOAD_UOPS_RETIRED.L3_MISS                                                    ( +-  0.59% )  (13.38%)

       0.261470543 seconds time elapsed                                          ( +-  0.21% )
  • LOAD_HIT_PRE.SW_PF 有增加代表 _mm_prefetch 有用
  • L2 cache HIT 只增加一點點,L3 cache HIT 倒是增加了許多,但是 Hint 是 T1 預期是 L2 cache HIT 會增加
  • prefetch 有作用但是時間卻沒有明顯減少反而增加了,假設是 PFDIST 取的不好,來認真研究一下 PFDIST 要怎麼取

PFDIST

illusion030@illusion030-X550LD:~/Desktop/memory_latency$ sudo modprobe msr
illusion030@illusion030-X550LD:~/Desktop/memory_latency$ ./mlc
Intel(R) Memory Latency Checker - v3.1a
tried to open /dev/cpu/0/msr
Cannot access MSR module. Possibilities include not having root privileges, or MSR module not inserted (try 'modprobe msr', and refer README)
  • 索性直接進入 root 模式來執行看看,結果成功了
root@illusion030-X550LD:/home/illusion030/Desktop/memory_latency# ./mlc
Intel(R) Memory Latency Checker - v3.1a
Measuring idle latencies (in ns)...
	Memory node
Socket	     0	
     0	  62.3	

Measuring Peak Memory Bandwidths for the system
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using traffic with the following read-write ratios
ALL Reads        :	19368.4	
3:1 Reads-Writes :	16337.4	
2:1 Reads-Writes :	16425.5	
1:1 Reads-Writes :	18360.1	
Stream-triad like:	17216.6	

Measuring Memory Bandwidths between nodes within system 
Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
	Memory node
 Socket	     0	
     0	19441.2	

Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	 74.88	  17074.9
 00002	 74.87	  17084.0
 00008	 74.79	  17060.9
 00015	 74.34	  16855.8
 00050	 66.22	   9211.8
 00100	 65.24	   6029.9
 00200	 63.82	   3733.0
 00300	 64.71	   2855.0
 00400	 63.50	   2424.7
 00500	 63.97	   2143.7
 00700	 63.72	   1828.5
 01000	 63.42	   1589.5
 01300	 63.21	   1460.4
 01700	 63.10	   1357.4
 02500	 63.07	   1248.7
 03500	 63.06	   1182.3
 05000	 62.99	   1133.5
 09000	 62.96	   1081.9
 20000	 62.99	   1045.7

Measuring cache-to-cache transfer latency (in ns)...
Local Socket L2->L2 HIT  latency	30.4
Local Socket L2->L2 HITM latency	35.7
  • 邊執行程式邊跑 mlc 得到的 latency
Measuring Loaded Latencies for the system
Using all the threads from each core if Hyper-threading is enabled
Using Read-only traffic type
Inject	Latency	Bandwidth
Delay	(ns)	MB/sec
==========================
 00000	 86.82	  15573.2
 00002	 85.72	  15515.8
 00008	 85.22	  15528.7
 00015	 86.11	  15460.6
 00050	 75.05	   8977.5
 00100	 73.72	   5871.2
 00200	 71.39	   3587.3
 00300	 72.28	   2730.4
 00400	 71.15	   2291.8
 00500	 70.06	   2041.5
 00700	 70.79	   1712.6
 01000	 70.38	   1474.3
 01300	 70.33	   1347.3
 01700	 69.51	   1259.1
 02500	 69.82	   1146.5
 03500	 69.62	   1083.9
 05000	 69.46	   1037.1
 09000	 69.98	    978.6
 20000	 69.15	    954.5
  • 算出平均為 74.029473684 (ns)
  • 接下來找出迴圈跑的時間
loop time = 65344 us
avx: 	 65358 us
  • each loop time = 65344 us / (512 * 512) = 0.249267578
  • 1 us = 1000 ns
  • 74.029473684 / 249.267578125 = 0.296987977
  • 取上高斯之後 D >= 1,所以把 avx_prefetch 的 PFDIST 設成 8,也就是一倍,用 T1 跑跑看
naive: 	 247381 us
sse: 	 128336 us
sse_prefetch: 	 71009 us
avx: 	 66292 us
avx_prefetch: 	 64742 us
  • 時間還是沒有加快多少,再用 perf stat 觀察看看
Performance counter stats for './avx_prefetch' (100 runs):

         3,162,625      cache-misses              #   79.583 % of all cache refs      ( +-  0.44% )  (14.07%)
         3,973,992      cache-references                                              ( +-  0.47% )  (15.46%)
         7,050,577      L1-dcache-load-misses     #    1.75% of all L1-dcache hits    ( +-  0.53% )  (23.21%)
       403,421,197      L1-dcache-loads                                               ( +-  0.14% )  (21.87%)
       202,331,106      L1-dcache-stores                                              ( +-  0.17% )  (19.32%)
            57,234      L1-icache-load-misses                                         ( +-  2.94% )  (13.47%)
           344,731      r014c    //LOAD_HIT_PRE.SW_PF                                                      ( +-  2.17% )  (13.63%)
           262,287      r024c    //LOAD_HIT_PRE.HW_PF                                                     ( +-  8.04% )  (22.16%)
       353,934,197      r81d0    //MEM_UOPS_RETIRED.ALL_LOADS                                                     ( +-  0.68% )  (19.86%)
       349,566,969      r01d1    //MEM_LOAD_UOPS_RETIRED.L1_HIT                                                     ( +-  0.42% )  (18.09%)
           780,509      r02d1    //MEM_LOAD_UOPS_RETIRED.L2_HIT                                                   ( +-  0.47% )  (14.26%)
            50,093      r04d1    //MEM_LOAD_UOPS_RETIRED.L3_HIT                                                    ( +-  2.24% )  (14.12%)
         1,227,596      r08d1    //MEM_LOAD_UOPS_RETIRED.L1_MISS                                                    ( +-  0.72% )  (13.87%)
           452,265      r10d1    //MEM_LOAD_UOPS_RETIRED.L2_MISS                                                    ( +-  0.87% )  (13.66%)
           400,909      r20d1    //MEM_LOAD_UOPS_RETIRED.L3_MISS                                                  ( +-  0.91% )  (13.44%)

       0.256960590 seconds time elapsed                                          ( +-  0.18% )
  • cache hit 的部份有比較跟預期的一樣, L2 LOAD HIT 有增加

  • 因為 l / s 實在太小了只有大約 0.3,沒辦法取到一個很好的 PFDIST,所以 AVX+prefetch 在這個程式上沒有很好的表現