2017q1 Homework4 (phonebook-concurrent)

contributed by < stanleytazi >

tags: `stanleytazi` `hw4` `phonebook`

作業目標

學習第 4 週提及的 concurrency 程式設計
學習 POSIX Thread
建立 thread pool 來管理 worker thread

硬體資訊

Architecture:          x86_64
CPU 作業模式：    32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                4
On-line CPU(s) list:   0-3
每核心執行緒數：2
每通訊端核心數：2
Socket(s):             1
NUMA 節點：         1
供應商識別號：  GenuineIntel
CPU 家族：          6
型號：              69
Model name:            Intel(R) Core(TM) i5-4260U CPU @ 1.40GHz
製程：              1
CPU MHz：             926.484
CPU max MHz:           2700.0000
CPU min MHz:           800.0000
BogoMIPS:              4000.06
虛擬：              VT-x
L1d 快取：          32K
L1i 快取：          32K
L2 快取：           256K
L3 快取：           3072K
NUMA node0 CPU(s):     0-3

上課前老師列出可以探討的其他同學的共筆，發現 csielee 想要探討的問題跟我的滿類似，甚至更深入，可以參考，包括 pthread_setconcurrency() 的作用、把實作跟界面切割，這樣可以讓不同的實作更容易做替換、以直接做 mmap() 取代 file alignment + mmap()

想到的問題

程式裡有用一個 MAX_LAST_NAME_SIZE 來做 file alignment ，並且也利用這個來做 file 的切割，多的部份就會跳出警告，但是如果名字真的很長，例如現役NBA球星 Giannis Antetokounmpo 或是
mlb 球員 Jarrod Saltalamacchia(結果我知道的這兩個都沒超過程式設定的16 xd)，不過所要表達的是如果有超過的是不是要額外處理？比如說用兩個單位來儲存
目前儲存資料的方式是用 singly linked list ，如果之後想要換成別種儲存方式，那把界面與實作切割開來之後想換會比較容易
append()計算時間的起點不公平，應該要連做 file align 的時間都要算進去來跟原始程式比較，所以針對每一個 text file 裡的 word都會需要做 alignment & append，所以想到可以做實驗
- 把 file alignment 動作也包括進去算時間
- 讓 file aligment 也可以做 multi-thread
- 做 file aligment 也可以做 mmap
看起來目前程式對照地鼠燒書，比較像是在把在 append 階段 parallel execution，如果照前一個項目討論有 aligment & append 那在固定 thread 個數下去比較(假設4個)，是4個一樣是作 alignment + append 或是 aligment(2) + append(2)
最後是另外一個問題是自己未完成的部份，如果沒有要搜尋的人沒有 lastName該如何處理?

concurrent

pthread_setconcurrency

what is pthread_setconcurrency?
程式裡有用到這個，想理解這個的用處，找到
Programming with POSIX Threads 這本書的 google 分享片段
我的理解為這個是一個給 kernel的 hint，可以根據程式的實際狀況來設定有多少個 kernel thread 的需求，書中有一些例子 (從例子看起來是 user&kernel thread M-to-N 的關係 )，
(情境一) 如果有15個 user threads，其中有10個被 block 住，但是你希望另外 5 個可以繼續跑，那你就必須要給個提示程式需要15個 kernel threads
(情境二) 那如果你只希望5個只要有1個是可以繼續跑就好了，那你就可以設定11個 kernel threads
(情境三) 如果你知道程式不是很常會有 block 住全部threads，然後就算全被 block 住一下也還可以接受，甚至你知道超過6個 threads block 情況非常非常少，那你可以設定提示需要7個 kernel threads
但是書中有提到這個東西在一些系統是遵守 UNIX98 下可以自由的使用，很多系統只是把這個設定值當作是 pthread_getconcurrency() 的回傳值，然後不做任何事情，例如在 Digital UNIX ，kernel mode 以及 user mode 的 scheduler 會確保已經在 ready queue 的 thread 不會因為 block 住的thread 然後也不能跑
所以我們在 Linux 下使用這個到底有沒有用呢？
目前查到 UNIX98 是 Single UNIX Specification version 2 (SUSv2)的核心而從 SUSv3 開始才與 POSIX 合成同一份規格書，目前在 wikipedia 紀錄最新的版本為 IEEE Std 1003.1-2008, 2016 Edition and Single UNIX Specification, Version 4, 2016 Edition，在 THE LINUX PROGRAMMING INTERFACE 書中有提到 Linux 的總體目標是希望可以符合 POSIX 與 SUS 規範，但是在該本書撰寫時，並沒有任何一個 Linux 版本有授予 "UNIX" 商標，另外一個是如果我們可以使用 POSIX 那他是否可以向下相容？看到這突然想起之前找過 man

The pthread_setconcurrency() function informs the implementation of
       the application's desired concurrency level, specified in new_level.
       The implementation takes this only as a hint: POSIX.1 does not
       specify the level of concurrency that should be provided as a result
       of calling pthread_setconcurrency().
...
CONFORMING TO         top

       POSIX.1-2001, POSIX.1-2008.
       
Concurrency levels are meaningful only for M:N threading

       implementations, where at any moment a subset of a process's set of
       user-level threads may be bound to a smaller number of kernel-
       scheduling entities.  Setting the concurrency level allows the
       application to give the system a hint as to the number of kernel-
       scheduling entities that should be provided for efficient execution
       of the application.

       Both LinuxThreads and NPTL are 1:1 threading implementations, so
       setting the concurrency level has no meaning.  In other words, on
       Linux these functions merely exist for compatibility with other
       systems, and they have no effect on the execution of a program.

看到最後一段，原來 manual 寫得這寫著 Linux 上使用 pthread_setconcurrency() 與 pthread_getconcurrency() 只是為了相容，然後在程式中執行是不會有任何作用的。之後我們可以用實驗數據來看一下如果沒有這個 function 執行時間的差異

修正原始程式問題

看到 jack81306 同學有修正在 phonebook-concurrent 版本有 name 不見了，總共要有 349901
原因是在把 linked list 接起來的地方都多用了 pNext，導致 4個 thread 的 list head 都不見了


















 for (int i = 0; i < THREAD_NUM; i++) {
        if (i == 0) {
            pHead = thread_args[i]->lEntry_head;
   // orig: pHead = thread_args[i]->lEntry_head->pNext;
            DEBUG_LOG("Connect %d head string %s %p\n", i,
                      pHead->lastName, thread_args[i]->data_begin);
        } else {
            e->pNext = thread_args[i]->lEntry_head;
    //orig: e->pNext = thread_args[i]->lEntry_head->pNext;
            DEBUG_LOG("Connect %d head string %s %p\n", i,
                      e->pNext->lastName, thread_args[i]->data_begin);
        }

        e = thread_args[i]->lEntry_tail;
        DEBUG_LOG("Connect %d tail string %s %p\n", i,
                  e->lastName, thread_args[i]->data_begin);
        DEBUG_LOG("round %d\n", i);
    }

不過我自己在原始版本去算只有349900













    int count_orig = 0;
    while (fgets(line, sizeof(line), fp)) {
        while (line[i] != '\0')
            i++;
        line[i - 1] = '\0';
        i = 0;
        e = append(line, e);
        count_orig++;
    }

理解mmap()

CheHsuan同學共筆
看到這個想到在 OS 有提過這個function，可以把 I/O 映射到記憶體，接著就可以對記憶體做操作，以避免過多 read()/write()

待補

實驗

原始程式

拿掉 pthread_setconcurrency()

發現執行時間竟然有差別，而且是會超過 20% 的差異，這樣 man 所講是哪裡有認知錯誤呢？除非在 Linux 上 pthread 的使用是 M:N ?

把 file alignment 的時間加進來

可以看到時間會比較多，不過這也可以說明 input file 格式重要性，如果資料是有經過特殊排列，對於後續的處理是很有幫助的
之後實驗會以此為來試著改進

直接對原始 file 做 mmap() + 並約略均分 file 給不同 threads 做處理

一開始直接把需要處理的檔案 map 到 memory
並且試著去做均分 size_t sizeForThread = file_size/THREAD_NUM;
但是因為很可能分割在一個單字的中間，所以要找界線
(i+1)-th thread 的處理資料的開始位址會是 i-th thread 的處理資料的結束位址




















int fd = open(DICT_FILE, O_RDONLY | O_NONBLOCK);
    off_t file_size = fsize(DICT_FILE);
    map = mmap(NULL, file_size, PROT_READ | PROT_WRITE, MAP_PRIVATE, fd, 0);
    assert(map && "mmap error");
    data_end = map;
    size_t sizeForThread = file_size/THREAD_NUM;
    for (int i=0; i < THREAD_NUM; i++) {
        thread_args[i].data_start = data_end;
        if (i != (THREAD_NUM-1)) {
            data_end = map + (i + 1) * sizeForThread;
            while (*data_end != '\n')
                data_end++;
            thread_args[i].data_end = data_end;
            data_end++;
        }
        else
            thread_args[i].data_end = map + file_size;
        thread_args[i].entry_list_head = NULL; 
        thread_args[i].entry_list_tail = NULL; 
    }

append 主要就是每個 thread 會在自己處理資料範圍內找出每個單字
找到後就可以 alloc一塊記憶體來儲存
- 這邊是可以再優化的地方，可以先做 memory pool
接著也是各自串一個 singly linked list

























void append(void *arg)
{
    thread_arg *t_arg = (thread_arg *) arg;
    char *data = t_arg->data_start;
    int w = 0;
    int count = 0;
    entry *e = NULL;
    while (data < t_arg->data_end) {
        if (*(data+w) == '\n') {
            count++;
            e = (entry *)malloc(sizeof(entry));
            e->lastName = data;
            *(data+w) = '\0';
            data+=(w+1);
            w = 0;
            if (!t_arg->entry_list_tail)
                t_arg->entry_list_tail = e;
            e->pNext = t_arg->entry_list_head;
            t_arg->entry_list_head = e;
        }
        w++;
    }
    t_arg->count = count;
    pthread_exit(NULL);
}

執行結果
- THREAD_NUM=1
- THREAD_NUM=2
- THREAD_NUM=4
- THREAD_NUM=8
  
  以上可以看出來Multi-thread時 thread num 超過2之後執行時間會上升
perf record -e cycles ./phonebook_opt
- 花最多時間在比較'\n'
- 其次是malloc

  2.27 │3f:   mov    -0x20(%rbp),%eax                                                                                                                 ▒
  2.27 │      movslq %eax,%rdx                                                                                                                        ▒
  2.27 │      mov    -0x18(%rbp),%rax                                                                                                                 ▒
  2.27 │      add    %rdx,%rax                                                                                                                        ▒
  4.55 │      movzbl (%rax),%eax                                                                                                                      ▒
 25.00 │      cmp    $0xa,%al                                                                                                                         ▒
       │    ↓ jne    c9                                                                                                                               ▒
       │                count++;                                                                                                                      ▒
  4.55 │      addl   $0x1,-0x1c(%rbp)                                                                                                                 ▒
       │                e = (entry *)malloc(sizeof(entry));                                                                                           ▒
 11.36 │      mov    $0x18,%edi                                                                                                                       ▒
       │    → callq  malloc@plt                                                                                                                       ▒
  2.27 │      mov    %rax,-0x8(%rbp)                                                                                                                  ▒
       │                e->lastName = data;                                                                                                           ▒
       │      mov    -0x8(%rbp),%rax                                                                                                                  ▒
  6.82 │      mov    -0x18(%rbp),%rdx                                                                                                                 ▒
       │      mov    %rdx,(%rax)                                                                                                                      ▒
       │                *(data+w) = '\0';                                                                                                             ▒
  4.55 │      mov    -0x20(%rbp),%eax                                                                                                                 ▒
       │      movslq %eax,%rdx                                                                                                                        ▒
       │      mov    -0x18(%rbp),%rax