2017q1 Homework4 (phonebook-concurrent)

contributed by <baimao8437>

認識 Concurrent (並行)

工作可拆分成「獨立執行」的部份(decomposition), 這樣「可以」讓很多事情一起做, 但是「不一定」要真的同時做。
Concurrency 可能會用到 parallelism，但不一定要用 parallelism 才能實現 concurrency。

這兩句話很簡單的可以理解並行與平行的差別

背景知識^[1]

Evaluation
- value computations:對一串運算式計算的結果
- side effect:也就是對物件狀態的修改
Sequenced-Before
對同一個thread下，求值順序關係的描述

Happens-Before
程式碼「效果」的呈現順序舉例：

    A = B + 1;              // (1)
    B = 1;                  // (2)
```
Compiler 最佳化編譯後組語為:
```
movl    B(%rip), %eax  
movl    $1, B(%rip)     //(2) finish
addl    $1, %eax
movl    %eax, A(%rip)   //(1) finish
```
雖然後者先結束 但看起來效果像先 1 再 2 即符合 Happens-Before

Synchronized-with
就是跨 thread 版本的happens-before
ps.Sequenced-before 其實就是同一個 thread 內的 happens-before
Sequential Consistency
- 對於每個獨立的處理單元，執行時都維持程式的順序(Program Order)
- 整個程式以某種順序在所有處理器上執行
Lock-Free
還是有lock 但不會影響到其他執行續的運行(non-blocking)，不一定比非lock-free快
Coroutines
讓原本循序執行的陳述指令，得以做出交錯執行的結果，由 programmer 控制

Code Refactoring

「在不改變軟體的外在行為之下，改善既有軟體的內部設計」

重構的主要目的就是為了提升可讀性，以及為了日後有了需求的變化時，可以更容易因應修改或是擴充。

一些常見的「Bad smell」^[2]

Duplicated Code
在作業中，常因為要測試很多種方法，而使某些程式碼不斷的重複，可使用:
- Extract method
- Form Template Method
Long Method
Programming people have realized that the longer a procedure is, the more difficult it is to understand.
- 有可另外提出的程式碼: Extract method
- 太多的 temporary variables: Replace Temp with Query
Comments
需要comment表示這段程式碼充滿妨礙閱讀的 bad smell，先 refactoring 後，往往就不需要 comment 了
- 需要註解一段程式碼: Extract Method
- 需要註解的程式碼已是 method: Rename Method
- 需要註解執行程式的必要條件: Introduce assertion(老師常用！)

實驗實作

開發環境

baimao@baimao-Aspire-V5-573G:~$ lscpu
Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              69
Model name:            Intel(R) Core(TM) i5-4200U CPU @ 1.60GHz
製程：              1
CPU MHz：             1711.242
CPU max MHz:           2600.0000
CPU min MHz:           800.0000
BogoMIPS:              4589.38
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           3072K
NUMA node0 CPU(s):     0-3

原始效能分析

記得 echo 3 | sudo tee /proc/sys/vm/drop_caches 清空 cache

$ ./phonebook_orig

size of entry : 136 bytes
execution time of append() : 0.111719 sec
execution time of findName() : 0.005537 sec

perf orig

 Performance counter stats for './phonebook_orig' (100 runs):

         1,202,610      cache-misses              #   78.679 % of all cache refs      ( +-  0.49% )
         1,528,504      cache-references                                              ( +-  0.19% )
       320,439,956      instructions              #    1.48  insn per cycle           ( +-  0.02% )
       216,931,003      cycles                                                        ( +-  0.15% )

       0.087285532 seconds time elapsed                                          ( +-  0.27% )

./phonebook_opt 這裡預設的 thread_num = 4

orginal file size = 3206080
size of entry : 24 bytes
execution time of append() : 0.012572 sec
execution time of findName() : 0.003665 sec

perf opt THREAD=4

 Performance counter stats for './phonebook_opt' (100 runs):

           387,281      cache-misses              #   23.459 % of all cache refs      ( +-  0.47% )
         1,650,873      cache-references                                              ( +-  0.71% )
       247,715,431      instructions              #    1.43  insn per cycle           ( +-  0.09% )
       173,636,294      cycles                                                        ( +-  0.64% )

       0.091929529 seconds time elapsed                                          ( +-  1.40% )

gnuplot

分析

opt 版本主要實作了以下去優化 append 效能:

text align
- 將所有資料的長度對齊輸出至 align.txt
- 不夠長補'\0' 太長截斷
- fsize 會回傳整的檔案的大小(bytes)
entry_pool
- 依據 text align 的 fsize 去一次 malloc 全部 entry
mmap
- 將 align.txt 映射至虛擬記憶體 VMA（Virtual Memory Area）
- 好處:高速檔案存取、執行緒共享記憶體
pthread
- 預設 thread num = 4
- 每個 thread 工作量一樣 (append data = 87475)

至於 findName() 時間也變快的原因下面 show_entry 後就可以看出來
針對特定測資就會出現一些誤差預設搜尋的zyxel
在 orig 版本所在位置在 349891 行
而用 opt 版本由於順序是會被打亂的此次出現在 262423 行所以搜尋的時間花費較少但速度其實是一樣的

正確性

append:

利用 phonebook_opt.c定義的show_entry來檢視用 pthread 建立的 linked list 是否與來源(words.txt)一致

可以發現

順序是被打亂的

words.txt 有 349900行 show_entry只有349896行少四行
發現原本程式中有點小錯誤再串接各 thread 建立的 list 時

pHead = thread_args[0]->lEntry_head->pNext;
...
for (int i = 1; i < THREAD_NUM; i++) {
        e->pNext = thread_args[i]->lEntry_head->pNext;
        ...

應改為

pHead = thread_args[0]->lEntry_head;
...
for (int i = 1; i < THREAD_NUM; i++) {
        e->pNext = thread_args[i]->lEntry_head;
        ...

行數就正確了

findName:

opt 版的 findName 有點小疑惑建好 list 後 show_entry 為 349900 行沒錯
但不知道為何 findName 完後再執行 show_entry 就變成 349899 少了一行
回頭看完學長影片才知道這是故意設計成這樣的
但我下面要實作 delete 所以設計回 findName 完後可以 show_entry 出完整 list 的版本

while (pHead) {
        if (strncasecmp(lastname, pHead->lastName, len) == 0
                && (pHead->lastName[len] == '\n' ||
                    pHead->lastName[len] == '\0')) {
            // pHead->lastName[len] = '\0';

再優化

merge linked list improve^[3]

use macro at thread set

for (int i = 0; i < THREAD_NUM; i++)
    thread_args[i] = createThread_arg(map + MAX_LAST_NAME_SIZE * i, map + file_size, i,THREAD_NUM, entry_pool + i);
for (int i = 0; i < THREAD_NUM; i++)
    pthread_create(&threads[i], NULL, (void *)&append, (void *)thread_args[i]);

改寫成

for (int i = 0; i < THREAD_NUM; i++)
        THREAD_SET(i);

time(100 run):append()0.004217 findName() 0.005007
append 時間減少一個設定好就先跑的概念

實作 Delete Name

orig 時間:

execution time of append() : 0.052377 sec
execution time of delete() : 0.005598 sec
execution time of findName() : 0.005562 sec

opt 時間:

execution time of append() : 0.004075 sec
execution time of delete() : 0.003510 sec
execution time of findName() : 0.003502 sec

在建完 list 後可以透過指令DELETE(name)來刪除特定的資料
delete 的速度應該跟 findName 是差不多的

Refactoring

改用之前修改的時間工具 Stopwatch 測量時間
參考老師的API包裝法包裝了原本的 Stopwatch
將 orig & opt 版本包裝成 API ， phonebook_orig.h & phonebook_opt.h 合併成 phonebook.h 實現多型(polymorphism)

重構之後程式碼好閱讀多了依照階段來完成任務並會自動對應到不同版本的實作成就感UPUP~
~~然後剩不到一天還剩兩個作業…~~