Try   HackMD

2016q3 Homework1 (phonebook)

contributed by <mingnus>

tags: ncku sysprog perf linux phonebook

環境設定

  • 作業系統: Ubuntu 16.04.1, kernel 4.4.0-38
    硬體: Intel Celeron N3150

  • 安裝作業所需套件: gnuplot-x11, linux-tools-common, linux-tools-4.4.0-38-generic

  • [optional] 調整系統locale (例如LC_NUMERIC會影響perf stat千位數逗號的位置)
    http://www.davidpai.tw/ubuntu/2011/ubuntu-set-locale/
    (Ubuntu可從UI設定: System Settings > Language Support > Regional Formats; 設定後重開機)

  • 調整 CPU 電源設定, 以測得 CPU 最大效能

    • GRUB: intel_idle.max_cstate, processor.max_cstate
    • sysfs:
      cat /sys/devices/system/cpu/cpu<N>/cpufreq/scaling_available_governors
      echo performance > /sys/devices/system/cpu/cpu<N>/cpufreq/scaling_governors

perf Tryout

設定

http://wiki.csie.ncku.edu.tw/embedded/perf-tutorial

$ sudo sh -c "echo 0 > /proc/sys/kernel/perf_event_paranoid"
$ sudo sh -c "echo 0 > /proc/sys/kernel/kptr_restrict"

統計performance counter資訊

## perf stat [-r <repeat_count>] [-e <events>] <command>
$ perf stat --repeat 100 \
-e cache-misses,cache-references,instructions,cycles \
./phonebook_orig

sample output:

 Performance counter stats for './phonebook_orig' (100 runs):

         3,422,897      cache-misses              #   29.110 % of all cache refs      ( +-  0.18% )  (74.90%)
        11,758,558      cache-references                                              ( +-  0.27% )  (50.51%)
       265,764,121      instructions              #    0.63  insns per cycle          ( +-  0.07% )  (75.42%)
       424,232,571      cycles                                                        ( +-  0.08% )  (75.24%)

       0.212240981 seconds time elapsed                                          ( +-  0.25% )

行末括號內的數字是events multiplex percentages, 當取樣的PMU event類型超過PMU個數時就會啟用此功能: https://perf.wiki.kernel.org/index.php/Tutorial#multiplexing_and_scaling_events

收集performance counter資訊

$ sudo perf record -e cache-misses,cache-references,instructions,cycles ./phonebook_orig

顯示performance counters資訊

$ sudo perf report

Sample view:

  • self-overhead: function 本身的 sample counts, 不含 sample counts in child functions。所有的 self-overhead sum一定是100
  • children overhead: function本身 + child functions的sample count。加總children overhead沒意義, 但從child overhead可看出哪個function最花時間。
    https://perf.wiki.kernel.org/index.php/Tutorial#Overhead_calculation

    "It might be confusing that the sum of all the 'children' overhead values exceeds 100% since each of them is already an accumulationof 'self' overhead of its child functions. But with this enabled, users can find which function has the most overhead even if samples are spread over the children.”

Operations

  • enter:
    • without -g: choose operations on the symbol
    • with -g: expand/fold the callees
  • t: toggle thread only
  • a: enter annotation view
  • s: switch to different perf.data in the current directory

Annotation View:

Memory events

https://software.intel.com/en-us/forums/software-tuning-performance-optimization-platform-monitoring/topic/557604

實驗: 觀察bandwidth運作期間的event count。該程式是將一個heap buffer(連續空間)內的資料反覆讀進register來測試memory throughput。以下觀察循序讀取時的event count。
(注意: 最好不要用fastest參數執行bandwidth,否則本程式觸發的event count不夠多, 不易和背景程式的events區隔)

  • 連續存取小於L1 cache size的buffer: 大多是L1-dcache-loads events, 極少觸發cache-misses, L1-dcache-load-misses, L1-dcache-prefetches, cache-misses等events
  • 連續存取小於L2 cache size的buffer: 產生大量L1-dcache-load-misses及L1-dcache-prefetches events,數量接近L1-dcache-loads
  • 連續存取大於L2 cache size的buffer: 產生cache-misses events
  • CPU處理64-byte aligned位址時較容易發生cache相關events (包括cache miss), 行為符合cache line大小

Cache misses in _Reader.L2():

       │    Disassembly of section .text:
       │
       │    0000000000401588 <_Reader.L2>:
       │    _Reader.L2():
  1.87 │ 0:   mov    (%r10),%rax
  0.34 │      mov    0x8(%r10),%rax
  0.68 │      mov    0x10(%r10),%rax
  1.92 │      mov    0x18(%r10),%rax
  4.58 │      mov    0x20(%r10),%rax
  4.30 │      mov    0x28(%r10),%rax
  1.47 │      mov    0x30(%r10),%rax
  2.32 │      mov    0x38(%r10),%rax
  1.81 │      mov    0x40(%r10),%rax
  2.38 │      mov    0x48(%r10),%rax
  2.43 │      mov    0x50(%r10),%rax
  1.81 │      mov    0x58(%r10),%rax
* 9.50 │      mov    0x60(%r10),%rax
*10.46 │      mov    0x68(%r10),%rax
  2.04 │      mov    0x70(%r10),%rax
  2.43 │      mov    0x78(%r10),%rax
  1.30 │      mov    0x80(%r10),%rax
  1.70 │      mov    0x88(%r10),%rax
  2.32 │      mov    0x90(%r10),%rax
  1.36 │      mov    0x98(%r10),%rax
* 6.90 │      mov    0xa0(%r10),%rax
* 6.50 │      mov    0xa8(%r10),%rax
  1.47 │      mov    0xb0(%r10),%rax
  2.21 │      mov    0xb8(%r10),%rax
  2.71 │      mov    0xc0(%r10),%rax
  2.09 │      mov    0xc8(%r10),%rax
  0.96 │      mov    0xd0(%r10),%rax
  0.79 │      mov    0xd8(%r10),%rax
  4.47 │      mov    0xe0(%r10),%rax
* 5.20 │      mov    0xe8(%r10),%rax
  2.04 │      mov    0xf0(%r10),%rax
  3.00 │      mov    0xf8(%r10),%rax
  2.94 │      add    $0x100,%r10
  0.51 │      cmp    %rsi,%r10
  1.13 │    ↑ jb     0
       │      dec    %rdx
  0.06 │    ↓ jne    fffffffffffffffd
       │      pop    %r10
       │    ← retq

see also:

方法評估

  1. 英文last name字首很少重覆, 用字元當trie的key會不會很浪費空間? 且搜尋次數為字串長度
  2. Binary search tree或rbtree對少量資料表現還不錯

Experiments

Baseline

本程式是連續呼叫malloc()建立鏈結串列, 所有的串列節點理論上是線性分布於heap空間。但因struct __PHONE_BOOK_ENTRY大小超過cache line size (64 bytes),所以每當存取下一個串列節點的姓名欄位時, 必然引發L1-dcache-loads events, 也容易產生cache miss。
(問題: memory access pattern與cache prefetch的資料量是否有關? 我們現在只存取每個節點開頭的16 bytes lastName, CPU是否會prefetch超過64 byte的資料?)

perf output:

 Performance counter stats for './phonebook_orig' (100 runs):

         1,326,960      L1-dcache-load-misses                                         ( +-  0.37% )  (59.37%)
         3,344,492      cache-misses              #   29.130 % of all cache refs      ( +-  0.29% )  (41.10%)
        11,481,124      cache-references                                              ( +-  0.26% )  (41.32%)
       264,859,089      instructions              #    0.62  insns per cycle          ( +-  0.11% )  (60.90%)
       425,551,760      cycles                                                        ( +-  0.09% )  (59.41%)

       0.212769801 seconds time elapsed                                          ( +-  0.20% )

實驗(1): Reduce structure size

加強資料的special locality: 將struct __PHONE_BOOK_ENTRY內非鍵值欄位移出,結構縮小為32 bytes。

perf output:

 Performance counter stats for './phonebook_opt' (100 runs):

           529,675      L1-dcache-load-misses                                         ( +-  1.23% )  (58.87%)
         1,217,650      cache-misses              #   27.935 % of all cache refs      ( +-  0.63% )  (40.89%)
         4,358,932      cache-references                                              ( +-  0.39% )  (41.92%)
       246,162,656      instructions              #    0.70  insns per cycle          ( +-  0.14% )  (61.60%)
       349,640,074      cycles                                                        ( +-  0.08% )  (59.62%)

       0.175088380 seconds time elapsed                                          ( +-  0.24% )

實驗(2): Binary search tree

基於實驗(1), 再建立binary search tree以加快搜尋速度: 以額外函式將sorted linked list轉為binary search tree。
struct __PHONE_BOOK_ENTRY裡新增兩個指向左右子樹的指標,struct size增為48 bytes。

perf output:

 Performance counter stats for './phonebook_opt' (100 runs):

           164,926      L1-dcache-load-misses                                         ( +-  1.69% )  (59.74%)
           813,046      cache-misses              #   19.045 % of all cache refs      ( +-  0.86% )  (40.67%)
         4,269,014      cache-references                                              ( +-  0.58% )  (41.01%)
       206,653,954      instructions              #    0.73  insns per cycle          ( +-  0.16% )  (61.62%)
       283,291,500      cycles                                                        ( +-  0.10% )  (60.09%)

       0.143766307 seconds time elapsed                                          ( +-  0.34% )

Result

測試前均釋放page caches

$ echo 3 > /proc/sys/vm/drop_caches

值得注意的是,如果把struct size改為和cache line一樣大小的話 (64 bytes),效能仍較原始版本佳 (原因待查)

program output:

size of entry : 64 bytes
execution time of append() : 0.127740 sec
execution time of findName() : 0.017339 sec

perf output:

 Performance counter stats for './phonebook_opt' (100 runs):

           865,688      L1-dcache-load-misses                                         ( +-  0.29% )  (59.67%)
         1,910,430      cache-misses              #   27.976 % of all cache refs      ( +-  0.50% )  (41.08%)
         6,828,855      cache-references                                              ( +-  0.23% )  (40.91%)
       253,911,736      instructions              #    0.68  insns per cycle          ( +-  0.14% )  (60.72%)
       373,165,944      cycles                                                        ( +-  0.09% )  (59.44%)

       0.186791556 seconds time elapsed                                          ( +-  0.21% )

其他可改進處

  1. main()裡面的字串處理方式 by c14006078
    https://hackmd.io/s/BkN1JNQp
    main()是用fgets()取出字串, 但fgets()會保留行末的換行字元,所以main()必須將其刪除。這邊是以while-loop得知字串長度, 效率不佳。

其他

Macro expansion (原因未查明)

main.c line 69: assert()參數IMPL沒被展開

Example: test.c

#include <stdio.h>
#include <assert.h>
#define MY_STR "my string"
int main() {
	assert(0 && "test: " MY_STR);
	return 0;
}

輸出: macro MY_STR 沒被展開

    test: test.c:7: main: Assertion `0 && "test: " MY_STR' failed.

檢查 preprocessor

$ gcc -E -DM test.c
#define assert(expr) ((expr) ? __ASSERT_VOID_CAST (0) : __assert_fail (#expr, __FILE__, __LINE__, __ASSERT_FUNCTION))

$ gcc -E -P test.c
int main() {
 ((0 && "test: " "my string") ? (void) (0) : __assert_fail ("0 && \"test: \" MY_STR", "test.c", 7, __PRETTY_FUNCTION__));
}

可以看到傳入__assert_fail()的IMPL沒被展開

其他問題

  • 如何用perf讀取AMD CPU的RAW event counters ?
  • 為何用taskset綁定CPU core後, 測試數據較佳? (cache-references & cache-misses約下降5%)
  • perf sample rate (-F|frequency) 的意義, 與sampled event count的關係
  • 為何kernel會自動drop sample rate?
    e.g., kernel log: perf interrupt took too long (10034 > 10000), lowering kernel.perf_event_max_sample_rate to 12500
    https://bbs.archlinux.org/viewtopic.php?id=170471
  • 找個記錄memory address access pattern的工具