2017q1 Homework4 (phonebook-concurrent)

contributed by <xdennisx>

硬體規格

dennis@dennis-X550CC:~$ lscpu
Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              58
Model name:            Intel(R) Core(TM) i5-3337U CPU @ 1.80GHz
製程：              9
CPU MHz：             1270.898
CPU max MHz:           2700.0000
CPU min MHz:           800.0000
BogoMIPS:              3592.04
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           3072K
NUMA node0 CPU(s):     0-3

分析實作

text_align
```
    text_align(DICT_FILE, ALIGN_FILE, MAX_LAST_NAME_SIZE);
    int fd = open(ALIGN_FILE, O_RDONLY | O_NONBLOCK);
    off_t file_size = fsize(ALIGN_FILE);
```
- text_align 將原本 DICT_FILE 的每一行名字都對齊 MAX_LAST_NAME_SIZE 的長度，長度太短就在後面補滿 \0，最後輸出成 ALIGN_FILE，這樣之後讀檔每一行的大小就都是一樣的，目的要讓記憶體對齊可加速讀取。
- open 為開啟檔案函式，第二個參數為開啟檔案的方式，若開啟成功則會回傳 0。
- fsize 會利用 stat() 回傳以 byte 為單位的檔案大小。

mmap

利用處理過的 ALIGN_FILE 直接對映到記憶體上，mmap 會回傳資料起始位址


  map = mmap(NULL, file_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);

順便利用計算過的 file_size，一次就把所有的 entry 給 malloc 好


  entry_pool = (entry *)malloc(sizeof(entry) *
                               file_size / MAX_LAST_NAME_SIZE);

pthread

pthread_setconcurrency(THREAD_NUM + 1)
用來提示系統，表明希望的 thread 數量，這邊 +1 應該是希望能排除 main-thread。
參考資料說到，如果 CPU 同時一被分配給一個以上的 thread 的話，可能就會改善性能。
但是這裡說到LinuxThreads 跟 NPTL 都是 1：1，用pthread_setconcurrency 好像沒啥用處@@

應該使用繁體中文/台灣地區的慣用科技術語，請修改。不要輕易說「好像」，你做過實驗了嗎？ –jserv

好的老師，我會改進，發揮實驗精神 –xdennisx

根據我的硬體規格是 2 threads per core, 2 cores per socket.
嘗試用不同的 THREAD_NUM 去看結果
n=3

execution time of append() : 0.004465 sec
execution time of findName() : 0.001146 sec

n=4

execution time of append() : 0.008299 sec
execution time of findName() : 0.003174 sec

n=5

execution time of append() : 0.004265 sec
execution time of findName() : 0.001105 sec

n=6

execution time of append() : 0.006449 sec
execution time of findName() : 0.001518 sec

n=7

execution time of append() : 0.006740 sec
execution time of findName() : 0.004059 sec

n=8

execution time of append() : 0.007461 sec
execution time of findName() : 0.004209 sec

原本想用 pthread_getconcurrency() 看看是不是設定超過最大數量的 thread，pthread_getconcurrency() 會回傳一樣的值，不過結果我設多少他就回傳多少，而且 pthread_setconcurrency() 設多少效能上也沒有什麼太大的改變，從此判定 pthread_setcomcurrency() 在 Linux 上是沒什麼太大用處

我的 kernel model：


dennis@dennis-X550CC:~/phonebook-concurrent$ getconf GNU_LIBPTHREAD_VERSION
NPTL 2.23

分配每個 thread 下去工作




for (int i = 0; i < THREAD_NUM; i++)
    // Created by malloc, remeber to free them.
    thread_args[i] = createThread_arg(map + MAX_LAST_NAME_SIZE * i, map + file_size, i,
                                      THREAD_NUM, entry_pool + i);

最後再把他串起來








for (int i = 0; i < THREAD_NUM; i++) {
    if (i == 0) {
        pHead = thread_args[i]->lEntry_head->pNext;
    } else {
        e->pNext = thread_args[i]->lEntry_head->pNext;
    }
    e = thread_args[i]->lEntry_tail;
}

原始輸出圖

改善方法

Thread pool

ThreadPool 的基本原理是將要執行的工作放進一個 Queue 裡，然後啟動多條 Thread，每條 Thread 會去檢查 Queue 裡還有沒有工作要做，若有就取出來。
由於要管控每件工作只交給其中一個 Thread 去處理，所以得動用 lock機制；也就是同一時間只能開放一條 Thread 去檢查 Queue 並取出工作，可以想像成全部的 Thread 排成一列輪流存取。
在程式裡透過 pool->count 掌握工作何時全部做完，在遞減 pool->count 也要等同要啟動 lock 機制，任何時間下只允許一個 Thread 去修改它的值。
概念圖
- threadpool_add function 就是把工作塞進去 Queue 裡面，*threadpool_thread function 就是把 Queue 裡面的工作分配給 Thread
- Queue 裡面每個元素
參考 0140454 跟 ierosodin 實作 threadpool 結果…

Hmmm…哪邊出問題R 0.0

不過 cache-miss 有降

phonebook_opt

Performance counter stats for './phonebook_opt' (100 runs):

         731,497      cache-misses              #   40.420 % of all cache refs      ( +-  0.93% )
       1,809,747      cache-references                                              ( +-  1.81% )
     246,807,242      instructions              #    1.20  insns per cycle          ( +-  0.12% )
     205,503,773      cycles                                                        ( +-  1.06% )

     0.083420525 seconds time elapsed                                          ( +-  1.80% )

phonebook_tp

Performance counter stats for './phonebook_tp' (100 runs):

         789,868      cache-misses              #   36.357 % of all cache refs      ( +-  0.60% )
       2,172,511      cache-references                                              ( +-  1.40% )
     259,023,470      instructions              #    1.16  insns per cycle          ( +-  0.05% )
     223,875,263      cycles                                                        ( +-  0.71% )

     0.079537321 seconds time elapsed                                          ( +-  0.79% )

Reference

ierosodin
C 的 Thread Pool 筆記
 Threadpool 比較