Try   HackMD

2017q1 Homework1 (phonebook)

contributed by < Hsiang >

作業解說 video

Reviewed by yangyang95

  • 文字訊息不要用圖片表示,以方便搜尋
  • Cache-misses 的比例 92.475 "%" ,百分比需明確標註
  • .gitignore 裡面可以寫成 *.txt *.png 來忽略特定檔案格式
  • commit 1d98e806cd66156ab0393942921262bfae052a2b
    筆記中應介紹三種 Hash function 是如何實作,之間的差異在哪
    這個 commit 也應考慮分成三個,一次一個實作,並配上適合的 message
  • 可以做圖比較各實作 append() findName() 時間差異,並放在最後,方便比較

目標

  • 熟悉 Perf
  • 優化 phonebook 效能
  • 學習關於Cache基礎

開發環境

作業系統:

uname -a
Linux ubuntu 4.8.0-39-generic #42~16.04.1-Ubuntu SMP Mon Feb 20 15:06:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Cache 架構

⚡  lstopo
Machine (16GB)
  Package L#0 + L3 L#0 (6144KB)
    L2 L#0 (256KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#4)
    L2 L#1 (256KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
      PU L#2 (P#1)
      PU L#3 (P#5)
    L2 L#2 (256KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
      PU L#4 (P#2)
      PU L#5 (P#6)
    L2 L#3 (256KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
      PU L#6 (P#3)
      PU L#7 (P#7)
  HostBridge L#0
    PCI 8086:0166
      GPU L#0 "renderD128"
      GPU L#1 "card0"
      GPU L#2 "controlD64"
    PCIBridge
      PCI 14e4:1686
        Net L#3 "enp1s0f0"
    PCIBridge
      PCI 14e4:4331
        Net L#4 "wlp2s0"
    PCI 8086:1e03
      Block(Disk) L#5 "sda"
      Block(Disk) L#6 "sdb"
  • L1i Cache 32KB
  • L1d Cache 32KB
  • L2 Cache 256KB
  • L3 Cache 6144KB
Cache line size
⚡ getconf LEVEL1_DACAHE_LINESIZE
64

perf

設定一般user sudo sh -c 'echo 1 >/proc/sys/kernel/perf_event_paranoid'

問題分析

  • cache miss 是什麼?

效能分析-優化前

  • ./phonebook_orig
size of entry : 136 bytes
execution time of append() : 0.082711 sec
execution time of findName() : 0.006030 sec
  • Stat 分析
⚡ sudo perf stat -r 100 ./phonebook_orig

Performance counter stats for './phonebook_orig' (100 runs):

  62.434152  task-clock (msec)       # 0.996 CPUs utilized            ( +-  2.01% )
          0  context-switches        # 0.002 K/sec                    ( +- 25.73% )
          0  cpu-migrations          # 0.000 K/sec
     12,354  page-faults             # 0.198 M/sec                    ( +-  0.00% )
177,805,159  cycles                  # 2.848 GHz                      ( +-  0.69% )
 84,945,067  stalled-cycles-frontend # 47.77% frontend cycles idle    ( +-  1.33)
264,406,729  instructions            # 1.49  insn per cycle
                                     # 0.32  stalled cycles per insn  ( +-  0.02% )
 48,204,531  branches                # 772.086 M/sec                  ( +-  0.01% )
    923,397  branch-misses           # 1.92% of all branches          ( +-  0.93% )
0.062661753  seconds time elapsed                                     ( +-  2.00% )

由於0.996 CPUs utilized非常接近1屬CPU BOUND型

  • 針對 Event Types 分析
⚡  sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict" 取消禁用 Kenerl pointer

⚡  sudo perf stat -r 100 -e \
    cache-misses,cache-references, \
    L1-dcache-load-misses, \
    L1-dcache-store-misses, \
    L1-dcache-prefetch-misses, \
    L1-icache-load-misses, \
    instructions, \
    cycles \
    ./phonebook_orig
  Performance counter stats for './phonebook_orig' (100 runs):

  1,790,769   cache-misses # 92.475 % of all cache refs  ( +-  1.41% )  (61.16%)
  1,936,484   cache-references                           ( +-  1.32% )  (64.08%)
  2,679,276   L1-dcache-load-misses                      ( +-  1.14% )  (65.20%)
  1,000,437   L1-dcache-store-misses                     ( +-  0.70% )  (54.45%)
  799,336     L1-dcache-prefetch-misses                  ( +-  3.44% )  (53.35%)
  125,428     L1-icache-load-misses                      ( +-  1.45% )  (49.85%)
  257,663,444 instructions # 1.56  insn per cycle        ( +-  0.34% )  (61.42%)
  165,474,641 cycles                                     ( +-  0.36% )  (58.85%)
  0.095967552 seconds time elapsed                       ( +-  1.83% )

Cache-misses 的比例92.475

  • perf hotspot
⚡ sudo perf record -e cache-misses ./phonebook_orig
⚡ sudo perf report

Samples: 350  of event 'cache-misses', Event count (approx.): 1655901
Overhead  Command Shared  Object             Symbol
# red
  41.68%  phonebook_orig  libc-2.23.so       [.] __strcasecmp_l_avx
  35.11%  phonebook_orig  [kernel.kallsyms]  [k] clear_page_c_e
   5.39%  phonebook_orig  phonebook_orig     [.] findName
# green
   4.07%  phonebook_orig  [kernel.kallsyms]  [k] _raw_spin_lock
   2.51%  phonebook_orig  [kernel.kallsyms]  [k] get_mem_cgroup_from_mm
   2.09%  phonebook_orig  [kernel.kallsyms]  [k] copy_user_enhanced_fast_string
   1.97%  phonebook_orig  [kernel.kallsyms]  [k] get_page_from_freelist
   1.27%  phonebook_orig  [kernel.kallsyms]  [k] handle_mm_fault
   1.15%  phonebook_orig  [kernel.kallsyms]  [k] pte_lockptr.isra.15
   0.81%  phonebook_orig  [kernel.kallsyms]  [k] try_charge
   0.80%  phonebook_orig  [kernel.kallsyms]  [k] mem_cgroup_try_charge
   0.64%  phonebook_orig  libc-2.23.so       [.] __strcasecmp_avx
   0.64%  phonebook_orig  [kernel.kallsyms]  [k] unmap_page_range
   0.62%  phonebook_orig  [kernel.kallsyms]  [k] free_hot_cold_page_list
   0.58%  phonebook_orig  [kernel.kallsyms]  [k] free_pcppages_bulk
   0.57%  phonebook_orig  [kernel.kallsyms]  [k] __alloc_pages_nodemask
   0.11%  phonebook_orig  ld-2.23.so         [.] _dl_map_object_from_fd
# white
   0.00%  perf            [kernel.kallsyms]  [k] __perf_event_header__init_id
   0.00%  perf            [kernel.kallsyms]  [k] strlcpy
   0.00%  perf            [kernel.kallsyms]  [k] perf_event_addr_filters_exec

根據Hotspot分析出來 前三個function是造成cache-miss主要的原因:

  • __strcasecmp_l_avx -> int strcasecmp (const char *s1, const char *s2)
  • clear_page_c_e ->
  • findName -> entry *findName(char lastName[], entry *pHead)

由函數指出字串比對次數過多導致miss次數變多

改善方向

由上面的分析數據顯示以下:

  • 儲存資料的結構與演算法導致miss次數增加
  • 搜尋名字的演算法沒有效率

Optimization

L1d Cache size = 32KB,Cache line size = 64Bytes

Cache Line的個數
32 * 1024 / 64 = 512

每筆Entry有136Bytes,因此一次存入L1d的數量就只有
32 * 512 / 136 大約 120筆 Entry

根據大小改善的原始結構後

每筆Entry剩下24Bytes,因此一次存入L1d的數量增加為
32 * 512 / 24 大約 682筆 Entry

phonebook_orig.h
#define MAX_LAST_NAME_SIZE 16 typedef struct __PHONE_BOOK_ENTRY { struct __PHONE_BOOK_ENTRY *pNext; char lastName[MAX_LAST_NAME_SIZE]; }
Performance counter stats for './phonebook_opt' (100 runs):

216,188  cache-misses #   72.132 % of all cache refs      ( +-  0.48% )
299,709  cache-references                                 ( +-  0.37% )

0.061514855 seconds time elapsed

時間上的成長

size of entry : 136 bytes
execution time of append() : 0.084283 sec
execution time of findName() : 0.006502 sec

===========================================

size of entry : 24 bytes
execution time of append() : 0.068817 sec
execution time of findName() : 0.003831 sec

減少不必要的欄位空間,使cache-misses率下降了約20%,搜尋以及儲存資料的時間也有些微下降

Hash Function(BKDR)

整理了 Hash Function
先是建立了一張3571大小的TABLE,再不減少Entry的情況下進行分析

size of entry :136 bytes
execution time of append() : 0.110690 sec
execution time of findName() : 0.000001 sec

==============================================================

Performance counter stats for './phonebook_hash' (100 runs):

2,514,577   cache-misses  # 56.257 % of all cache refs          ( +-  0.27% )  (57.49%)
4,469,801   cache-references                                    ( +-  0.31% )  (58.22%)
5,879,673   L1-dcache-load-misses # 5.82% of all L1-dcache hits ( +-  0.25% )  (58.54%)
101,099,783 L1-dcache-loads                                     ( +-  0.60% )  (50.24%)
629,741     L1-dcache-prefetch-misses                           ( +-  0.99% )  (28.11%)
3,190,335   L1-dcache-store-misses                              ( +-  0.43% )  (29.31%)
130,865     L1-icache-load-misses                               ( +-  0.88% )  (43.36%)

0.164596361 seconds time elapsed

減少大小後

size of entry :32 bytes
execution time of append() : 0.099975 sec
execution time of findName() : 0.000001 sec

==============================================================

Performance counter stats for './phonebook_hash' (100 runs):

1,701,515  cache-misses # 45.586 % of all cache refs           ( +-  0.32% )  (57.57%)
3,732,515  cache-references                                    ( +-  0.35% )  (58.29%)
5,336,750  L1-dcache-load-misses # 5.56% of all L1-dcache hits ( +-  0.40% )  (58.54%)
95,991,463 L1-dcache-loads                                     ( +-  0.87% )  (50.07%)
482,912    L1-dcache-prefetch-misses                           ( +-  0.77% )  (28.06%)
2,756,774  L1-dcache-store-misses                              ( +-  0.80% )  (29.43%)
70,032     L1-icache-load-misses                               ( +-  0.79% )  (43.42%)

0.162973286 seconds time elapsed

由數據分析換上Hash Function後
Cache-miss的比例降低不少 92.475 -> 56.257
append時間多出0.03 sec
findName時間則下降至0.000001 sec
然而減少Hash大小也減少的些微Cache-miss的比例 56.257 -> 45.586
append時間下降 0.011 sec

Cache-misses 45.586% 依然偏高


學習ierosodin的分析,也著手分析大小對於分佈的影響



(mod 9997)

size of entry :32 bytes
execution time of append() : 0.101987 sec
execution time of findName() : 0.000000 sec

==============================================================

 Performance counter stats for './phonebook_hash' (100 runs):

1,676,158  cache-misses  # 43.189 % of all cache refs            ( +-  0.29% )  (57.55%)
3,880,979  cache-references                                      ( +-  0.34% )  (58.36%)
5,470,038  L1-dcache-load-misses  # 5.51% of all L1-dcache hits  ( +-  0.42% )  (58.80%)
99,279,186 L1-dcache-loads                                       ( +-  0.71% )  (50.62%)
483,688    L1-dcache-prefetch-misses                             ( +-  0.84% )  (27.92%)
2,852,689  L1-dcache-store-misses                                ( +-  0.80% )  (29.11%)
75,499     L1-icache-load-misses                                 ( +-  0.86% )  (43.37%)

0.167409641 seconds time elapsed

根據結果
當把TABLE_SIZE設成9997時,search時間下降到小於 10-6 sec
且append時間只增加了0.002 sec,Cache-misses 也多下降了 2%
也驗證了分佈越均勻確實可以下降Cache-misses的比例


  • perf hotspot
Samples: 695  of event 'cache-misses', Event count (approx.): 1660031
Overhead  Command         Shared Object      Symbol
  49.30%  phonebook_hash  libc-2.23.so       [.] malloc_consolidate
  22.13%  phonebook_hash  libc-2.23.so       [.] _int_free
  12.29%  phonebook_hash  [kernel.kallsyms]  [k] clear_page_c_e

根據顯示 malloc的部分還是造成了許多Cache-misses的情況,也因此在釋放記憶體的時間增加了Caches misses的比例。

//main.c#55 entry *pHead, *e[TABLE_SIZE]; pHead = (entry *) malloc(sizeof(entry) * TABLE_SIZE); printf("size of entry :%lu bytes\n", sizeof(entry)); for (i = 0; i < TABLE_SIZE; i++) { e[i] = &pHead[i]; e[i]->pNext = NULL; } i = 0;

著手改善的方向memory pool

參考資料

網路資源
書籍
  • [書籍-計算機組織與設計(軟硬體介面) 5-3章節]
共筆