2017q3 Homework1 (phonebook)

contributed by < maskashura >

Perf 設定筆記

kernel.perf_event_paranoid權限設定:

在開始執行效能測試前,先注意一些設定權限是否打開:

kernel.perf_event_paranoid 是用來決定你在沒有 root 權限下 (Normal User) 使用 perf 時，你可以取得哪些 event data。預設值是 1 ，你可以輸入
cat /proc/sys/kernel/perf_event_paranoid

一共有四種權限值:

2 : 不允許任何量測。但部份用來查看或分析已存在的紀錄的指令仍可使用，如 perf ls、perf report、perf timechart、 perf trace。

1 : 不允許 CPU events data。但可以使用 perf stat、perf record 並取得 Kernel profiling data。

0 : 不允許 raw tracepoint access。但可以使用 perf stat、perf record 並取得 CPU events data。

-1: 權限全開。

將kernel.perf_event_paranoid權限設成1

sudo sh -c " echo 1 > /proc/sys/kernel/perf_event_paranoid"

取消kernel pointer 禁用:

如欲檢測 cache miss event ，需要先取消 kernel pointer 的禁用。
sudo sh -c " echo 0 > /proc/sys/kernel/kptr_restrict"

perf stat指令說明:

perf stat –repeat 5 -e cache-misses,cache-references,instructions,cycles ./filename

–repeat <n>或是-r <n> 可以重複執行 n 次該程序，並顯示每個 event 的變化區間。
cache-misses,cache-references和 instructions,cycles類似這種成對的 event，若同時出現 perf 會很貼心幫你計算比例。

perf top指令說明:

最後要觀察其他 event ( 預設 cycles ) 和指定取樣頻率 ( 預設每秒4000次 ) :
perf top -e cache-misses -c 5000

perf 原理和實務 (共筆)

GitHub設定

綁訂機器的 SSH key

產生 SSH 登入用的金鑰，可以使用 ssh-keygen 這個指令。在建立金鑰之前，要先建立 ~/.ssh 這個目錄，並設定正確的權限：

mkdir -p ~/.ssh
chmod 700 ~/.ssh

然後以 ssh-keygen 產生金鑰：

ssh-keygen -t rsa -C "your_email@example.com"

Your identification has been saved in /root/.ssh/id_rsa.
Your public key has been saved in /root/.ssh/id_rsa.pub.

查看ssh key
cat ~/root/.ssh/id_rsa.pub

將ssh key加入到 Github 網站,驗證有沒有綁定

ssh -T git@github.com

會得到下列訊息表示成功

Hi maskashura! You've successfully authenticated, but GitHub does not provide shell access.

GitHub指令

同步

將本機 repository 和遠端 Github repository 同步

git remote add origin git@github.com:your_account/your_repository.git
git push -u origin master
再輸入
git push

或參考git指引:

echo "# system.software2017" >> README.md
git init
git add README.md
git commit -m "first commit"
git remote add origin git@github.com:maskashura/system.software2017.git
git push -u origin master

clone

將遠端的 repository 複製一份到本機:

git clone git@github.com:Username/repository.git

如果沒有寫入權限 (Collaborators) 的話，因為這個專案是公開的，所以你還是可以用 Git 通訊協定 clone 下來：

git clone git://github.com/Username/repository.git

若因遭逢防火牆問題 (如成功大學校園環境)，請改用 HTTPS 通訊協定：

git clone https://github.com/Username/repository.git

Pull

從遠端更新
git pull
或
git pull origin master

Push

git push
或
git push origin master

參考資料
GitHub 設定指引
 SSH 公開金鑰認證

取得電腦環境

OS使用Lubuntu 17.04 , 64位元版本

取得cache大小, 輸入 lscpu 可得:

Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 4
On-line CPU(s) list: 0-3
Thread(s) per core: 2
Core(s) per socket: 2
Socket(s): 1
NUMA node(s): 1
Vendor ID: GenuineIntel
CPU family: 6
Model: 42
Model name: Intel® Core™ i3-2350M CPU @ 2.30GHz
Stepping: 7
CPU MHz: 799.890
CPU max MHz: 2300.0000
CPU min MHz: 800.0000
BogoMIPS: 4589.87
Virtualization: VT-x
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 3072K
NUMA node0 CPU(s): 0-3

實作驗證

由於不太熟悉整體效能測試流程，先參考了0140454同學共筆, 以其流程做修改測試

先記錄下修改前, 執行phonebook_orig結果

./phonebook_orig

size of entry : 136 bytes
execution time of append() : 0.070063 sec
execution time of findName() : 0.006372 sec

再執行make cache-test
make cache-test 內容:

cache-test: $(EXEC)
perf stat --repeat 100 
-e cache-misses,cache-references,instructions,cycles 
./phonebook_orig
perf stat --repeat 100 
-e cache-misses,cache-references,instructions,cycles

執行 make cache-test 後可得下列結果, 我們可看到 cache miss 高達91%

 Performance counter stats for './phonebook_orig' (100 runs):

         2,040,979      cache-misses              #   91.415 % of all cache refs      ( +-  0.06% )
         2,232,662      cache-references                                              ( +-  0.07% )
       261,727,331      instructions              #    1.33  insn per cycle                                              ( +-  0.02% )
       196,762,949      cycles                                                        ( +-  0.11% )

       0.088329791 seconds time elapsed                                          ( +-  0.26% )

執行 ./phonebook_org ; sudo perf top -p $! , 觀察程式中的效能瓶頸

(利用"$!", 可取得前面執行的程序pid )























Samples: 89  of event 'cycles', Event count (approx.): 43058077                 
Overhead  Shared Object     Symbol                                             ◆
  26.59%  phonebook_orig    [.] findName                                       ▒
   9.27%  libc-2.24.so      [.] 0x000000000014d719                             ▒
   7.92%  libc-2.24.so      [.] 0x000000000014ecf8                             ▒
   7.18%  libc-2.24.so      [.] 0x000000000014ed06                             ▒
   6.04%  libc-2.24.so      [.] 0x000000000014ecfc                             ▒
   4.85%  libc-2.24.so      [.] 0x000000000014ecf4                             ▒
   4.03%  [kernel]          [k] unmap_page_range                               ▒
   3.27%  libc-2.24.so      [.] 0x000000000014d756                             ▒
   3.26%  libc-2.24.so      [.] 0x000000000014ecf0                             ▒
   3.03%  libc-2.24.so      [.] 0x000000000014d6e5                             ▒
   3.02%  [kernel]          [k] free_hot_cold_page                             ▒
   3.00%  [kernel]          [k] release_pages                                  ▒
   2.12%  libc-2.24.so      [.] 0x000000000014d711                             ▒
   2.02%  libc-2.24.so      [.] 0x000000000014d742                             ▒
   2.02%  [kernel]          [k] free_pages_and_swap_cache                      ▒
   1.99%  [kernel]          [k] free_pcppages_bulk                             ▒
   1.86%  phonebook_orig    [.] append                                         ▒
   1.30%  libc-2.24.so      [.] 0x000000000014d6f7                             ▒
   1.06%  libc-2.24.so      [.] 0x000000000014ed0b                             ▒
   1.01%  libc-2.24.so      [.] 0x000000000014d74e                             ▒
   1.01%  libc-2.24.so      [.] 0x000000000014ed09

其中 findName 佔25% , 而 append 只佔1.8%

思考Cache miss

詳細閱讀 programming small 裡頭的電話簿程式，研究降低 cache miss 的方法

打開 phonebook_orig.h ,我們可看到PHONE_BOOK_ENTRY的結構
















#define MAX_LAST_NAME_SIZE 16

/* original version */
typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    char firstName[16];
    char email[16];
    char phone[10];
    char cell[10];
    char addr1[16];
    char addr2[16];
    char city[16];
    char state[2];
    char zip[5];
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

而 phonebook.c 中則描述 findName() 及 append()





















/* original version */
entry *findName(char lastName[], entry *pHead)
{
    while (pHead != NULL) {
        if (strcasecmp(lastName, pHead->lastName) == 0)
            return pHead;
        pHead = pHead->pNext;
    }
    return NULL;
}

entry *append(char lastName[], entry *e)
{
    /* allocate memory for the new entry and put lastName */
    e->pNext = (entry *) malloc(sizeof(entry));
    e = e->pNext;
    strcpy(e->lastName, lastName);
    e->pNext = NULL;

    return e;
}

1. 使用空間較小的 struct

以我的筆電為例L1 Cach 為 32KB = 32 * 1024 byte ，而 PhoneBook size = 136 bytes
以原先設計 32 * 1024 / (136*8) = 30.12 (大約只能存 30 筆左右)

剛看到題目時很傻的要直接簡化structure大小(想針對phone,cell,zip…etc.,及一些固定內容的資料state & city 做簡化增加空間利用), 但題目再往下拉就發現走錯路了…且降低的cache miss少的可憐

故我採用programming small中所說的，因為要求在於有沒有找到 last name , 而 struct 的其他資訊並無參考到，故我們可以先新增一個 struct 針對 last name , 再加上2指標 , 一指向擁有完整訊息的 struct "PHONE_BOOK_TOTAL" 另一指標為筆資料位址, 大小剛好是提示的32byte, 2 * 1024 / (32 * 8) = 128 筆 , 可以降低cache miss

新建struct "PHONEBOOK_LNAME" 並重新連結指標指向原 struct "PHONE_BOOK_ENTRY"





typedef struct __PHONE_BOOK_ENTRY {
    char lastName[MAX_LAST_NAME_SIZE];
    struct __PHONE_BOOK_TOTAL *total_data;
    struct __PHONE_BOOK_ENTRY *pNext;
} entry;

執行 make clean; make run_opt 檢視修改後結果

size of entry : 32 bytes
execution time of append() : 0.053531 sec
execution time of findName() : 0.003749 sec

整理最佳化前後的時間比較如下:

	Orig	Opt
size of entry	128 byte	36 byte
append()	0.070063 sec	0.053531 sec
findName()	0.006372 sec	0.003749 sec
再經過修改後空間較小的 struct , findName() 時間從0.0063降為0.0037sec , 而append() 由0.638降至0.055sec

執行 make plot 產生效能比較圖如下所示

再來檢查 cache miss 是否有改善, 輸入make cache-test , 可以發現cache miss從原本的91.745%降到了62.092% ，提昇了29%的效能


 Performance counter stats for './phonebook_orig' (100 runs):

         2,001,838      cache-misses              #   91.745 % of all cache refs      ( +-  0.05% )
         2,181,958      cache-references                                              ( +-  0.06% )
       262,277,307      instructions              #    1.36  insn per cycle                                              ( +-  0.02% )
       192,614,661      cycles                                                        ( +-  0.10% )

       0.087413542 seconds time elapsed                                          ( +-  0.29% )

 Performance counter stats for './phonebook_opt' (100 runs):

           346,679      cache-misses              #   62.092 % of all cache refs      ( +-  0.08% )
           558,328      cache-references                                              ( +-  0.22% )
       243,357,563      instructions              #    1.64  insn per cycle                                              ( +-  0.02% )
       148,287,579      cycles                                                        ( +-  0.12% )

       0.068586419 seconds time elapsed                                          ( +-  0.34% )

2. 使用 hash function 改善查詢速度

由 phone_orig.c 中可發現原先的 search 是採循序分析一一比對去找到目標，其時間複雜度為

O_{(n)}

, 在找大量資料時相當浪費時間，故我們改用HASH法, 找到和 phonebook 相關的 key並建立對應的 hash function 增進尋速度。

說到 phonebook 查詢 , 一般人都會採用字母先做篩選, 故我想先用字母做為 key value , 再用 linked list 去解決 collision 問題在此有想到一個問題，由於26個字母下 collision 的名字太多了, 有可能造成搜尋瓶頸, 不過目前想先把以字母為 key value 的搜尋完成再來探討有沒有更佳的 hash function.

在修改完程式並將三次測試結果圖示顯示後可以發現，加入字母為 key value 的 hash function 後findname時間明顯下降 (0.0037->0.00002 sec) , 大大降低了搜尋時間

參考資料

0140454同學共筆
benson326同學共筆
programming small
雜湊表(wikipedia)
圖解資料結構-使用c語言作者:胡昭明