Try   HackMD

2017q1 Homework3 (software-pipelining)

contributed by <baimao8437>

論文筆記

前言:

論文研究方向有三:

  1. source code-level analysis 直接說明SW、HW的優缺點
  2. SW、HW之間的協同作用(synergistic)和反作用(antagonistic effects)
  3. 在SW有request的情況下評估HW的訓練策略

名詞解釋:

  • Prefetch
    在 cpu 要運算之前把資料從主記憶體預先載入 cache,降低 cpu 等待時間進而提升效率.
    載入至主記憶體的時機不能太晚(cpu 已經運算完畢不再需要這些資料)也不能太早(有可能在使用之前就被踢出快取記憶體),所以計算 prefetch distance 很重要。

  • Hardware prefetch
    是用過去的存取紀錄作為 prefetch 的依據,所以需要 training。
    e.g. GHB:Global History Buffer

  • Software prefetch
    人為或 compiler 在適當的地方加入 prefetching intrinsics。
    但 ICC or GCC 中只有簡單的 SW 指令,所以比較仰賴人為加入。
    e.g. SSE 的 __mm_prefetch()

而對於手動加入指令有兩個最大的問題:

  1. 對於如何插入 prefetch intrinsic 沒有一個嚴格的標準
  2. 軟體與硬體的 prefetch 交互使用的複雜度沒有一個很好的了解

SW & HW 的交互作用實驗結果:

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

  • GHB(stride prefetcher)
    stride 是指說你是一次跳幾格去 prefetch cache line。
  • STR(stream prefetcher)
    stream 是指說你的 prefetch 是連續的抓取 cache line。

可以看到左邊 Positive 區的實驗結果顯示在這些 benchmark 的情況下,SW+HW 會比 HW 來得更有效率。

實驗結果重點

  • SW 對於 short array streams, irregular memory address patterns, and L1 cache miss reduction 效果較好
  • SW 會對 HW 的 training 產生干擾,讓效能變差,尤其是在 stream 的時候。

從實驗中,產生下列主要的三個問題:

  1. What are the limitations and overheads of software prefetching?
  2. What are the limitations and overheads of hardware prefetching?
  3. When is it beneficial to use software and/or hardware prefetching?

SW & HW 的背景知識

Data Structure

此表分析了不同 data structure 的性質及當時已經有提出的 HW SW prefetch 方式

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

而當中的 D 就是 prefetch distance

Prefetch hint

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

主要使用在SSE的__mm_prefetch([address],[hint]);

Prefetch 分類 & Timeliness

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

Prefetch distance

Prefetch 希望能夠 timely,所以要設定prefetch distance,要大到足夠隱藏 latency。

Dls

  • D is called the prefetch distance
  • l is the prefetch latency
  • s is the length of the shortest path through the loop body

Direct\Indirect index

  • Direct: 是直接可以從指令中知道要存取的記憶體位址,可以透過 hareward prefetch.

  • Indirect: 就像這行程式碼 x = a[b[i]],需要先計算b[i]的值才知道要存取a[]中的哪個位置,使用 software prefetch 比較容易達成.

Image Not Showing Possible Reasons
  • The image file may be corrupted
  • The server hosting the image is unavailable
  • The image path is incorrect
  • The image format is not supported
Learn More →

(a)Direct: 2個指令
(b)Indirect: 4個指令

程式實作 矩陣轉置

開發環境

baimao@baimao-Aspire-V5-573G:~$ lscpu
Architecture:          x86_64
CPU 作業模式:    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數:2
每通訊端核心數:2
Socket(s):             1
NUMA 節點:         1
供應商識別號:  GenuineIntel
CPU 家族:          6
型號:              69
Model name:            Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
製程:              1
CPU MHz:             1711.242
CPU max MHz:           2600.0000
CPU min MHz:           800.0000
BogoMIPS:              4589.38
虛擬:              VT-x
L1d 快取:          32K
L1i 快取:          32K
L2 快取:           256K
L3 快取:           3072K
NUMA node0 CPU(s):     0-3

CPU:Intel Colre i5-4200U 資訊

Architecture / Microarchitecture
Microarchitecture        Haswell
Platform		  Shark Bay
Processor core   	  Haswell
Core stepping   	  C0 (SR170)
Manufacturing process	  0.022 micron
Data width		  64 bit
The number of CPU cores  2
The number of threads	  4
Floating Point Unit	  Integrated
Level 1 cache size  	  2 x 32 KB 8-way set associative instruction caches 
			  2 x 32 KB 8-way set associative data caches
Level 2 cache size  	  2 x 256 KB 8-way set associative caches
Level 3 cache size	  3 MB 12-way set associative shared cache
Physical memory		  16 GB
Multiprocessing		  Uniprocessor
Features		  MMX instructions
			  SSE / Streaming SIMD Extensions
			  SSE2 / Streaming SIMD Extensions 2
			  SSE3 / Streaming SIMD Extensions 3
			  SSSE3 / Supplemental Streaming SIMD Extensions 3
			  SSE4 / SSE4.1 + SSE4.2 / Streaming SIMD Extensions 4  
			  AES / Advanced Encryption Standard instructions
			  AVX / Advanced Vector Extensions
			  AVX2 / Advanced Vector Extensions 2.0
			  BMI / BMI1 + BMI2 / Bit Manipulation instructions
			  F16C / 16-bit Floating-Point conversion instructions
			  FMA3 / 3-operand Fused Multiply-Add instructions
			  EM64T / Extended Memory 64 technology / Intel 64  
			  NX / XD / Execute disable bit  
			  HT / Hyper-Threading technology   
			  TBT 2.0 / Turbo Boost technology 2.0   
			  VT-x / Virtualization technology 
Low power features	  Enhanced SpeedStep technology 

開發紀錄

第一次執行結果

  0  1  2  3
  4  5  6  7
  8  9 10 11
 12 13 14 15

  0  4  8 12
  1  5  9 13
  2  6 10 14
  3  7 11 15
sse prefetch: 	 70873 us
sse: 		 144184 us
naive: 		 244073 us
  • gnuplot make plot
    Image Not Showing Possible Reasons
    • The image file may be corrupted
    • The server hosting the image is unavailable
    • The image path is incorrect
    • The image format is not supported
    Learn More →

這是100次取平均值 naive 的時間不穩定 所以平均的誤差大
但可以看到 sse + prefetch 的時間最短

接著改寫makefilemain將三種方法分開進行 perf 測試$ make cache-test
perf 所設定的參數[1]:

cache-misses
cache-references
L1-dcache-load-misses
L1-dcache-loads
L1-dcache-stores 
L1-icache-load-misses 

makefile:

EXEC = \
	sse_prefetch_transpose\
	sse_transpose\
	naive_transpose

cache-test: $(EXEC)
	for method in $(EXEC);do \
		perf stat --repeat 100 \
		-e cache-misses,cache-references,L1-dcache-load-misses,L1-dcache-loads,L1-dcache-stores,L1-icache-load-misses \
		./$$method;\
	done

這裡有個困惑 當我嘗試將 -e 後面的參數斷行
-e cache-misses,cache-references, \
L1-dcache-load-misses,L1-dcache-loads, \
L1-dcache-stores,L1-icache-load-misses \
執行時會發生錯誤:
event syntax error: 'cache-misses,cache-references,'
___ parser error
Run 'perf list' for a list of valid events
Usage: perf stat [<options>] [<command>]
-e, event <event> event selector. use 'perf list' to list available events

我改成
-e cache-misses,cache-references \
-e L1-dcache-load-misses,L1-dcache-loads \
-e L1-dcache-stores,L1-icache-load-misses
就沒有出現錯誤訊息了 claaaaassic

歐歐~ 那我.. 還是一行好了orzbaimao8437

pref 結果 $ make cache-test

  • sse prefetch
 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,759,440      cache-misses              #   85.111 % of all cache refs      ( +-  0.69% )  (32.89%)
         5,592,027      cache-references                                              ( +-  0.49% )  (50.39%)
         8,481,323      L1-dcache-load-misses     #    1.83% of all L1-dcache hits    ( +-  0.26% )  (67.13%)
       464,186,704      L1-dcache-loads                                               ( +-  0.07% )  (65.37%)
       234,986,970      L1-dcache-stores                                              ( +-  0.10% )  (61.59%)
            55,906      L1-icache-load-misses                                         ( +-  3.66% )  (32.76%)

       0.268066902 seconds time elapsed                                          ( +-  0.64% )
  • sse
 Performance counter stats for './sse_transpose' (100 runs):

         4,807,128      cache-misses              #   86.253 % of all cache refs      ( +-  0.64% )  (33.36%)
         5,573,299      cache-references                                              ( +-  0.31% )  (50.49%)
         8,545,880      L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  0.15% )  (67.18%)
       441,706,999      L1-dcache-loads                                               ( +-  0.10% )  (65.64%)
       231,528,304      L1-dcache-stores                                              ( +-  0.12% )  (62.41%)
            64,512      L1-icache-load-misses                                         ( +-  3.90% )  (32.84%)

       0.341006236 seconds time elapsed                                          ( +-  1.00% )
  • naive
 Performance counter stats for './naive_transpose' (100 runs):

        17,137,199      cache-misses              #   94.603 % of all cache refs      ( +-  0.28% )  (33.27%)
        18,114,904      cache-references                                              ( +-  0.19% )  (50.21%)
        21,069,839      L1-dcache-load-misses     #    3.82% of all L1-dcache hits    ( +-  0.11% )  (66.89%)
       551,000,987      L1-dcache-loads                                               ( +-  0.07% )  (65.83%)
       213,504,897      L1-dcache-stores                                              ( +-  0.16% )  (63.62%)
            71,981      L1-icache-load-misses                                         ( +-  4.93% )  (33.07%)

       0.467403806 seconds time elapsed                                          ( +-  1.33% )

可以看到 cache-misses 確實是 sse prefetch 最低
因為 prefetch 先將所需要的資料載入快取記憶體降低了 cache-misses,低的 miss rate 讓他的執行速度比較快。

Row Counter

參考kaizsv 的共筆intel 文件中找 4th Generation Haswell microarchitecture
察看下列項目:

  • r014c: LOAD_HIT_PRE.SW_PF
  • r024c: LOAD_HIT_PRE.HW_PF

結果:

 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,911,498      cache-misses              #   84.564 % of all cache refs      ( +-  0.78% )  (49.91%)
         5,808,046      cache-references                                              ( +-  0.49% )  (50.22%)
         8,670,151      L1-dcache-load-misses     #    1.88% of all L1-dcache hits    ( +-  0.39% )  (50.51%)
       460,232,822      L1-dcache-loads                                               ( +-  0.14% )  (45.85%)
       237,557,467      L1-dcache-stores                                              ( +-  0.22% )  (25.48%)
            54,105      L1-icache-load-misses                                         ( +-  3.69% )  (25.65%)
           499,850      r014c                                                         ( +-  1.84% )  (25.31%)
           272,620      r024c                                                         ( +-  5.06% )  (37.61%)

       0.277024409 seconds time elapsed                                          ( +-  1.18% )


 Performance counter stats for './sse_transpose' (100 runs):

         4,743,884      cache-misses              #   85.357 % of all cache refs      ( +-  0.39% )  (49.71%)
         5,557,698      cache-references                                              ( +-  0.28% )  (49.95%)
         8,494,094      L1-dcache-load-misses     #    1.99% of all L1-dcache hits    ( +-  0.23% )  (50.33%)
       426,955,982      L1-dcache-loads                                               ( +-  0.15% )  (46.63%)
       236,397,375      L1-dcache-stores                                              ( +-  0.26% )  (25.85%)
            60,052      L1-icache-load-misses                                         ( +-  3.60% )  (25.54%)
               464      r014c                                                         ( +-  4.00% )  (25.23%)
           312,872      r024c                                                         ( +-  4.40% )  (37.39%)

       0.326735088 seconds time elapsed                                          ( +-  0.44% )


 Performance counter stats for './naive_transpose' (100 runs):

        16,888,393      cache-misses              #   92.475 % of all cache refs      ( +-  0.25% )  (49.93%)
        18,262,729      cache-references                                              ( +-  0.20% )  (50.39%)
        21,341,610      L1-dcache-load-misses     #    4.01% of all L1-dcache hits    ( +-  0.17% )  (50.91%)
       532,720,948      L1-dcache-loads                                               ( +-  0.10% )  (48.00%)
       228,505,453      L1-dcache-stores                                              ( +-  0.23% )  (25.06%)
            64,509      L1-icache-load-misses                                         ( +-  3.95% )  (25.08%)
               496      r014c                                                         ( +-  5.12% )  (24.98%)
           249,019      r024c                                                         ( +-  4.90% )  (37.44%)

       0.437171123 seconds time elapsed                                          ( +-  0.56% )

sse prefetch 的 r014c 明顯比其他大 代表真的有執行 SW prefetch

實驗不同 Prefetch Distance

論文中提到 Prefetch Distance 會影響 Prefetch 的效率
impl.c 中的 PFDIST 就是 distance
將其用 0,2,4,7,8,9,10,12,16 代入 因為預設8 固檢查8附近的值
$ make plot DIS=1

結果符合理論 distance 過猶不及 這邊 8 代入的效率最好

實驗不同 Prefetch Hints

如同前方表格顯示共有4種hints 將其分別代入比較(預設是 T1)

  • _MM_HINT_T0
 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,667,916      cache-misses              #   82.749 % of all cache refs      ( +-  0.57% )  (49.02%)
         5,641,054      cache-references                                              ( +-  0.44% )  (49.48%)
         8,916,708      L1-dcache-load-misses     #    1.93% of all L1-dcache hits    ( +-  0.40% )  (50.67%)
       463,153,960      L1-dcache-loads                                               ( +-  0.18% )  (46.33%)
       235,037,792      L1-dcache-stores                                              ( +-  0.25% )  (25.85%)
            50,785      L1-icache-load-misses                                         ( +-  4.21% )  (25.70%)
         1,237,631      r014c                                                         ( +-  1.73% )  (24.94%)
           403,505      r024c                                                         ( +-  3.30% )  (36.74%)

       0.256539780 seconds time elapsed                                          ( +-  0.57% )
  • _MM_HINT_T1
 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,906,154      cache-misses              #   82.649 % of all cache refs      ( +-  0.20% )  (49.38%)
         5,936,140      cache-references                                              ( +-  0.14% )  (49.40%)
         8,625,730      L1-dcache-load-misses     #    1.84% of all L1-dcache hits    ( +-  0.27% )  (49.46%)
       467,546,983      L1-dcache-loads                                               ( +-  0.11% )  (44.87%)
       232,786,099      L1-dcache-stores                                              ( +-  0.17% )  (26.09%)
            37,113      L1-icache-load-misses                                         ( +-  1.55% )  (26.02%)
           441,530      r014c                                                         ( +-  1.60% )  (25.60%)
           428,107      r024c                                                         ( +-  2.72% )  (37.41%)

       0.253216818 seconds time elapsed                                          ( +-  0.10% )
  • _MM_HINT_T2
 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,931,244      cache-misses              #   83.404 % of all cache refs      ( +-  0.71% )  (49.53%)
         5,912,467      cache-references                                              ( +-  0.53% )  (49.67%)
         8,578,641      L1-dcache-load-misses     #    1.84% of all L1-dcache hits    ( +-  0.39% )  (49.89%)
       466,874,784      L1-dcache-loads                                               ( +-  0.17% )  (45.45%)
       234,373,157      L1-dcache-stores                                              ( +-  0.22% )  (26.58%)
            45,452      L1-icache-load-misses                                         ( +-  6.35% )  (25.87%)
           435,123      r014c                                                         ( +-  1.95% )  (25.43%)
           408,551      r024c                                                         ( +-  3.97% )  (37.36%)

       0.264967830 seconds time elapsed                                          ( +-  1.18% )
  • _MM_HINT_NTA
 Performance counter stats for './sse_prefetch_transpose' (100 runs):

         4,848,945      cache-misses              #   82.589 % of all cache refs      ( +-  0.41% )  (49.29%)
         5,871,166      cache-references                                              ( +-  0.36% )  (49.36%)
         8,568,859      L1-dcache-load-misses     #    1.83% of all L1-dcache hits    ( +-  0.27% )  (49.50%)
       467,023,161      L1-dcache-loads                                               ( +-  0.12% )  (45.17%)
       232,552,596      L1-dcache-stores                                              ( +-  0.16% )  (26.36%)
            35,771      L1-icache-load-misses                                         ( +-  2.76% )  (25.96%)
           451,018      r014c                                                         ( +-  1.83% )  (25.57%)
           428,982      r024c                                                         ( +-  2.64% )  (37.33%)

       0.253415363 seconds time elapsed                                          ( +-  0.18% )
  • gnuplot

    可以觀察出 T0 時 r014c 最高 因為他將資料 prefetch 到所有 level 所以效率最高

AVX

首先先從了解 SSE 的運作方式[2] 然後將原本的 4*4 matrix 改為 8*8 matrix
code的部份也照著 SSE 的步驟改為 AVX 但結果並不正確:

I0 ->  0  8 16 24 | 4 12 20 28
I1 ->  1  9 17 25 | 5 13 21 29
I2 ->  2 10 18 26 | 6 14 22 30
I3 ->  3 11 19 27 | 7 15 23 31
-----------------------------
I4 -> 32 40 48 56 | 36 44 52 60
I5 -> 33 41 49 57 | 37 45 53 61
I6 -> 34 42 50 58 | 38 46 54 62
I7 -> 35 43 51 59 | 39 47 55 63

左下和右上的角應該要交換才對
最後參考周曠宇的共筆 加入這幾行code完成交換

T0 = _mm256_permute2x128_si256(I0, I4, 0x20);
T1 = _mm256_permute2x128_si256(I1, I5, 0x20);
T2 = _mm256_permute2x128_si256(I2, I6, 0x20);
T3 = _mm256_permute2x128_si256(I3, I7, 0x20);
T4 = _mm256_permute2x128_si256(I0, I4, 0x31);
T5 = _mm256_permute2x128_si256(I1, I5, 0x31);
T6 = _mm256_permute2x128_si256(I2, I6, 0x31);
T7 = _mm256_permute2x128_si256(I3, I7, 0x31);

但這個指令我還在研究原理為何

試著加入 prefetch 但是 AVX 並沒有支援 prefetch 要到 AVX-512才有支援
所以先用 SSE 的 prefetch 這裡先預設 distance = 8, Hints = T1

  • 結果
    • avx perf
    ​​​​    -  Performance counter stats for './avx_transpose' (100 runs):
    
    ​​​​         3,612,052      cache-misses              #   77.914 % of all cache refs      ( +-  0.36% )  (50.22%)
    ​​​​         4,635,946      cache-references                                              ( +-  0.29% )  (50.45%)
    ​​​​         8,223,505      L1-dcache-load-misses     #    2.11% of all L1-dcache hits    ( +-  0.19% )  (50.50%)
    ​​​​       390,434,639      L1-dcache-loads                                               ( +-  0.07% )  (45.13%)
    ​​​​       216,084,485      L1-dcache-stores                                              ( +-  0.14% )  (25.38%)
    ​​​​            36,183      L1-icache-load-misses                                         ( +-  1.32% )  (25.96%)
    ​​​​               359      r014c                                                         ( +-  5.20% )  (25.59%)
    ​​​​           208,075      r024c                                                         ( +-  4.01% )  (37.76%)
    
    ​​​​       0.259036199 seconds time elapsed                                          ( +-  0.15% )
    
    • avx_prefetch perf
    ​​​​ Performance counter stats for './avx_prefetch_transpose' (100 runs):
    
    ​​​​         3,870,080      cache-misses              #   75.536 % of all cache refs      ( +-  1.15% )  (49.67%)
    ​​​​         5,123,514      cache-references                                              ( +-  0.64% )  (49.90%)
    ​​​​         8,183,794      L1-dcache-load-misses     #    2.01% of all L1-dcache hits    ( +-  0.43% )  (50.22%)
    ​​​​       407,171,607      L1-dcache-loads                                               ( +-  0.22% )  (45.78%)
    ​​​​       214,583,897      L1-dcache-stores                                              ( +-  0.31% )  (25.80%)
    ​​​​            58,325      L1-icache-load-misses                                         ( +-  3.85% )  (25.72%)
    ​​​​           259,162      r014c                                                         ( +-  2.77% )  (25.31%)
    ​​​​           351,523      r024c                                                         ( +-  4.43% )  (37.40%)
    
    ​​​​       0.277570860 seconds time elapsed                                          ( +-  1.61% )
    
    看到 r014c 的差異 代表有進行SW prefetch
    • gnuplot

      avx 的效能又比 sse prefetch 好 因為一次處理的資料量比較大

但有一個重點 avx 所用的 prefetch distance 應該要是 sse 版本的兩倍 也就是16才對
這裡比較一些不同的 distance 代入的結果

  • 結果
    • gnuplot

      執行時間還是 distance = 8 時比較好
    • perf
    ​​​​ Performance counter stats for './avx_prefetch_transpose 8' (20 runs):
    
    ​​​​         3,867,416      cache-misses              #   80.189 % of all cache refs      ( +-  2.16% )  (44.33%)
    ​​​​         4,822,896      cache-references                                              ( +-  2.20% )  (44.72%)
    ​​​​       684,115,466      cycles                                                        ( +-  2.98% )  (45.18%)
    ​​​​         7,658,761      L1-dcache-load-misses     #    1.90% of all L1-dcache hits    ( +-  1.46% )  (45.68%)
    ​​​​       403,675,699      L1-dcache-loads                                               ( +-  0.92% )  (40.26%)
    ​​​​       209,757,372      L1-dcache-stores                                              ( +-  0.89% )  (22.63%)
    ​​​​            60,160      L1-icache-load-misses                                         ( +-  8.71% )  (22.67%)
    ​​​​           284,360      r014c                                                         ( +-  5.87% )  (22.46%)
    ​​​​           419,681      r024c                                                         ( +- 10.79% )  (33.30%)
    
    ​​​​       0.285078156 seconds time elapsed                                          ( +-  3.82% )
    
    ​​​​ Performance counter stats for './avx_prefetch_transpose 16' (20 runs):
    
    ​​​​         3,839,351      cache-misses              #   58.430 % of all cache refs      ( +-  1.73% )  (44.75%)
    ​​​​         6,570,840      cache-references                                              ( +-  1.35% )  (45.41%)
    ​​​​       644,629,299      cycles                                                        ( +-  0.44% )  (45.90%)
    ​​​​         7,695,401      L1-dcache-load-misses     #    1.86% of all L1-dcache hits    ( +-  1.10% )  (46.16%)
    ​​​​       413,234,940      L1-dcache-loads                                               ( +-  0.37% )  (39.60%)
    ​​​​       213,451,682      L1-dcache-stores                                              ( +-  0.71% )  (22.13%)
    ​​​​            55,960      L1-icache-load-misses                                         ( +- 10.40% )  (22.69%)
    ​​​​           178,769      r014c                                                         ( +-  8.47% )  (22.50%)
    ​​​​           197,597      r024c                                                         ( +-  1.80% )  (33.55%)
    
    ​​​​       0.262698318 seconds time elapsed     
    
    1. cache misses 確實是 distance = 16 最低 但是是因為 cache refs 比別人多了很多 所以 miss rate 才低
    2. cycles 數 dis = 16 比 dis = 8 少了 40,000,000 次
    3. r014c r024c cache hit 次數都比 dis = 8 還少

Code 再進化

  • Use higher level macro to simplify avx code
    參考到類似作法 還在學習中
  • Use Stopwatch to record time
    將老師推薦的 Stopwatch 結合 clock_gettime 達到精準的時間測試

  1. kobeyu的筆記 ↩︎

  2. ierosodin的筆記 ↩︎