2017q1 Homework1 (phonebook)

contributed by <genehuang0011>

進度要加快摟~課程助教
好的！俊錚

開發環境

CPU:

























$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
Thread(s) per core:    2
Core(s) per socket:    2
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 61
Model name:            Intel(R) Core(TM) i5-5257U CPU @ 2.70GHz
Stepping:              4
CPU MHz:               2700.000
CPU max MHz:           3100.0000
CPU min MHz:           500.0000
BogoMIPS:              5399.89
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              3072K
NUMA node0 CPU(s):     0-3

Kernal:



$ name -a
Linux geneh-MacBookPro 4.8.0-39-generic 
#42~16.04.1-Ubuntu SMP Mon Feb 20 15:06:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

工具熟悉

Perf

首先依照 perf原理與實務步驟指示裝上 perf，在 root 輸入 $ perf top 測試其功能時，出現以下錯誤畫面：

目前的權限值是 3 ??? (上面權限值不是只有 -1 到 2 嘛奇怪)
因此必須更改權限值，輸入指令

$ sudo sh -c "echo -1 > /proc/sys/kernel/perf_event_paranoid"

接著再試一次會出現：

因此須輸入以下指令，以解決 kernel pointer 的禁用。

$ sudo sh -c "echo 0 > /proc/sys/kernel/kptr_restrict"

常用的功能有：

list :列出所以可觸發的 event
top:可看出各函式在某個 event 上的熱點，可看出是哪個函式拖垮效能
stat:針對特定 event 做效能檢查

gnuplot

在 gnuplot語法解說和示範一文中，可以了解到這是一個很方便且功能強大的資料視覺化工具，有很多參數可以設定，感覺要用 gnuplot 畫出一個清楚易懂的好圖表也是一門學問阿！

astyle

一個很強大的工具，可以幫忙檢查 coding style 是否一致。
當檢查出格式不一致時，可輸入下列指令將檔案格式化。

$ astyle --style=kr --indent=spaces=4 --indent-switches --suffix=none *.[ch]

效能分析

首先，使用剛裝好的工具來觀察未優化的 phonebook 程式。

程式碼分析

phonebook_orig.c




















entry *findName(char lastName[], entry *pHead)
{
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    /* allocate memory for the new entry and put lastName */
    e->pNext = (entry *) malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

perf top 熱點分析














Overhead  Shared Object   Symbol
  52.84%  libc-2.23.so    [.] __strcasecmp_l_avx
   9.39%  phonebook_orig  [.] findName
   7.68%  [kernel]        [k] unmap_page_range
   6.89%  libc-2.23.so    [.] __strcasecmp_avx
   4.28%  [kernel]        [k] page_remove_rmap
   3.97%  [kernel]        [k] __tlb_remove_page_size.part.60
   3.85%  [kernel]        [k] __mod_node_page_state
   3.74%  [kernel]        [k] __mod_zone_page_state
   3.65%  [kernel]        [k] uncharge_list
   3.50%  [kernel]        [k] free_pcppages_bulk
   0.20%  [kernel]        [k] perf_pmu_enable.part.93
   0.01%  [kernel]        [k] native_sched_clock
   0.00%  [kernel]        [k] perf_event_nmi_handler

make cache-test

$ perf stat --repeat 100 -e cache-misses,cache-references,instructions,cycles ./phonebook_orig

 Performance counter stats for './phonebook_orig' (100 runs):

         3,404,315      cache-misses              #   92.708 % of all cache refs      ( +-  0.04% )
         3,672,081      cache-references                                              ( +-  0.14% )
       262,497,727      instructions              #    1.43  insn per cycle           ( +-  0.02% )
       183,422,609      cycles                                                        ( +-  0.22% )

       0.059987389 seconds time elapsed                                          ( +-  0.22% )

make run

接著，將 cache 清空，並使用 watch 指令觀察執行時間。

$ watch -d -t "./phonebook_orig && echo 1 | sudo tee /proc/sys/vm/drop_caches"

由於我們對此次作業 cache misses 很在意，故要排除系統 cache 的影響，在每次執行之前手動藉由echo 1 > /proc/sys/vm/drop_caches來釋放 cache。

Freeing the page cache:
echo 1 > /proc/sys/vm/drop_caches
Free dentries and inodes:
echo 2 > /proc/sys/vm/drop_caches
Free the page cache, dentries and the inodes:
echo 3 > /proc/sys/vm/drop_caches

size of entry : 136 bytes
execution time of append() : 0.066283 sec
execution time of findName() : 0.004928 sec

使用 gnuplot 練習做一張圖，並作為之後的對照組。

小結

由熱點分析結果以及圖表，可以很清楚發現，在 findName() 效能的瓶頸主要發生在資料的字串比對上。
透過 perf cache-test 結果，可發現 cache-misses 高達 92.708% 。

預計解決方向

資料結構選擇上，因為原始使用 link-list 建構，字串必須「逐一」比對導致效率不彰，可考慮樹狀結構或先將其分類，以改善字串比對上的效能瓶頸。
縮小 entry-size 使得符合 cache 大小，以降低 cache-misses

參考資料

strcasecmp(3) - Linux man page