Homework 2 / phonebook-concurrent

tags: `jserv`

Date: 2016/10/23

參考網址

Github：https://github.com/hankGo/phonebook-concurrent

課程網頁

Reference

homework2 回顧
contributed by <ruby0109>
contribute by <bobsonlin> for recognize code
contribute by <andy19950> for refactor

To do list

驗證正確性

看完作業提示和前人的筆記之後，我們需要知道當初優化的版本是否已經正確將 words.txt 完整的放到 entry 裡面。

目前已知的驗證方法有兩種，

一是原始 code 提供了 show_entry 方法，可以將已經存入 entry 所有的 lastName 直接列出，可以使用 linux 的指令將 printf() 的輸出存到一文字檔，再利用 linux 提供的 diff 指令和原始的 words.txt 做比對，找出兩者檔案不同的地方。

這個方法的檢查速度會比較快一點，但是我不太會用 diff 方法，暫時要先用比較土法煉鋼的方法去查詢，但是慢很多。 Hank Chen

我個人的作法是模仿原始 code 裡面的 findname() 方法，原始的 findname() 是直接輸入指定單一 lastName 去做搜尋，而我是搭配 fgets()和 while loop 去讓他自動讀入 words.txt 的每一行原始資料，再利用每一行資料去 findname。但是這個的所需要的搜尋次數就是原始資料筆數 * entry 筆數 (其實也約略等於原始資料筆數)，時間複雜度約略為 O(n^2) 跑過一次真的很久，不過目前我只會這種方法。執行結果約略如下：

$ ./phonebook_opt

size of entry : 24 bytes
execution time of append() : 0.010886 sec
execution time of findName() : 0.004769 sec

aaaa
 is no found 
aaaaa
 is no found 
aaaaaa
 is no found 
aaaaaaa
 is no found

雖然其他人的筆記說這邊有問題，但是我看不懂，所以是先把 pthread 那邊看完才回來改這邊

可以發現，有四筆漏掉了，沒有成功輸入到 entry。因為是 opt 的版本有誤，所以要來看 opt 版本 append 的地方 ()，pHead = app[i]->pHead->pNext; 應改為 app[i]->pHead;，因為儲存 lastname 的資料串的頭 pHead 的下一個應該要接各個 pthread 所建立的資料串，應為 app[i]->pHead;，若為原始 code 的 app[i]->pHead->pNext;，則會靶子資料串的第一個跳掉，變成第二個開始，所以才會有上面部分資料被跳掉的狀況，修改完 code 如下：













    entry *etmp;
    pHead = pHead->pNext;
    
    for (int i = 0; i < THREAD_NUM; i++) {
        if (i == 0) {
            pHead = app[i]->pHead; // pHead 串第一個子資料串的頭
        } else {
            etmp->pNext = app[i]->pHead; // 第 n 個子資料串的尾，接第 n+1 資料串的頭
        }

        etmp = app[i]->pLast;

    }

基本流程

開啟檔案

orig 開啟檔案

使用fopen()測試檔案能不能打開
利用while和fgets()將檔案一行一行讀入 REF

opt 開啟檔案

file_align 定義在 file.c，主要為對齊檔案使用，主要是為了後面的 mmap 來鋪路，流程大概如下：
- writebuff 先使用 memset()塞滿\0，長度為一固定長度pad
- 確認讀進來新的一行，readbuff 的長度不超過 pad 的情況下，使用 strcpy() 來複製字串

下面這段不太懂，需要做測試 Hank Chen

    while (fgets(rbuf, sizeof(rbuf), fd0)) {
        memset(wbuf, '\0', pad);

        if ((suffix = (pad - strlen(rbuf))) != 0)
            strcpy(wbuf, rbuf);

        fwrite(wbuf, pad, 1, fd1);
    }

Q：哪邊有 align 的效果 = =？

對齊過後的檔案，用內建的文字編輯器看不出來，需要用 vim 才看的到所有的文字全部靠右，左邊剩餘的字元空間全部被 ^@ 塞滿，根據 ruby0109 同學的筆記，才知道這個是 NULL 符號，真的有對齊效果的感覺了，雖然下圖看起來沒有對齊的樣子，不過拿第一排為例，有 11 個 ^@ 和 5 個 a 加起來剛好是16個字元，其餘資料以此類推。

aaaa
^@^@^@^@^@^@^@^@^@^@^@aaaaa
^@^@^@^@^@^@^@^@^@^@aaaaaa
^@^@^@^@^@^@^@^@^@aaaaaaa
^@^@^@^@^@^@^@^@aaaaaaaa
^@^@^@^@^@^@^@aaaaaaaah
^@^@^@^@^@^@aaaaaaauugh
^@^@^@^@aaaaaagh
^@^@^@^@^@^@^@aaaaaahhhhh
^@^@^@^@aaaaaaugh
^@^@^@^@^@^@aaaaagh
^@^@^@^@^@^@^@^@aaaaah
^@^@^@^@^@^@^@^@^@aaaarthur
^@^@^@^@^@^@aaaaw
^@^@^@^@^@^@^@^@^@^@aaagghhhh
^@^@^@^@^@^@aaah
^@^@^@^@^@^@^@^@^@^@^@aaaugh
^@^@^@^@^@^@^@^@^@aaccf
^@^@^@^@^@^@^@^@^@^@aachen
^@^@^@^@^@^@^@^@^@aacr
^@^@^@^@^@^@^@^@^@^@^@aadland
^@^@^@^@^@^@^@^@aaem
^@^@^@^@^@^@^@^@^@^@^@aagate

open：取得一檔案指標，與 fopen 不一樣的是，回傳的值不是 FILE* 型態的，而是一個 int 型態。

Q：為什麼不用 fopen 而是用 open？

fopen() vs open()：可能是為了宣告 non-blocking mode 才棄fopen()
- fd = open(ALIGN_FILE, O_RDONLY | O_NONBLOCK);
- open()+O_RDONLY 同 fopen() r 模式 source1 source2
- O_NONBLOCK 宣告為 Non-blocking IO mode source1 source2 source3

可能是因為要用到後面的 mmap() 所以才用 open() 而不用 fopen()?
從 ruby0109 看到的 stackoverflow
hankgo

fsize()：用來算 file (align.txt) 的大小，也是為了 mmap 鋪路
這個一樣需要先看子函式的怎麼運作的，發現需要 man 一下 stat，這個函式 return total size in bytes linux man page

建立 linked list 資料串的頭 (entry)

先以 entry * 建立 pHead 資料串的頭之後，用 malloc 給他一塊空間，往後 findname 的搜尋起始就從這裡開始。

append 原始版本

fgets 的讀取特性是，當遇到換行字元 \n 或者已讀入 fgets() 所指定的字串大小，就會停止讀取該行。讀入的字串結尾就會有 \0，另外若是因為換行符號而中止的話，讀入的該字串最後兩個空間則分別為 \n \0，所以在後面做 append 的時候，原始的程式碼還有做將 \n 替換成 \0 的動作。
將字串的 \n 移除之後，將處理好的字串利用 append 放入 lastname 資料串裡面。

append (opt，使用 pthread)

這塊是這個最佳化版本的 phonebook 的重點所在，為了減少大量的 I/O 存取，所以使用了 mmap，再利用 pthread 去做多工 (這個詞語好像不太精確)，增加 append 的效能。最後再將各個子資料串串連起來，完成 append 的工作。

mmap

source
Linux 提供了記憶體映射函數 mmap，它把文件內容映射到一段記憶體上(準確說是虛擬記憶體上)，透過對這段記憶體的讀取和修改，實現對文件的讀取和修改。Linux 允許將檔案對映到記憶體空間中，如此可以產生一個在檔案資料及記憶體資料一對一的對映，例如字型檔的存取。

使用記憶體對映有許多好處：

高速檔案存取。一般的I/O機制通常需要將資料先到緩區中。記憶體對映免去了中間這一層，加速檔案存取速度。
可執行檔可對映到記憶體空間中，使程式動態載入。Linux Dynamic Loading便是如此實作出來的。
新的記憶體可以透過利用/dev/zero來產生全零的檔案。
新的記憶體可以用於執行目的，這對解譯式編譯器非常有用。
可把檔案當成記憶體來用，直接使用指標來操作。
對映的記憶體可當成 process 間共享記憶體，該記憶體內容存在檔案中，因此與 process 無關。

USAGE

void *mmap(void *addr, size_t length, int prot, int flags, 
           int fd, off_t offset);

Linux man page for mmap()

addr：記憶體映射區塊的起始位置，若設為NULL，代表讓系統自動選則合適的位址，然後回傳起始位置。
- 作業的 code 給 NULL，讓系統去指定起始位置。
length：映射區塊的大小。
- 作業的 code 給 fs，就是剛剛利用 fsize 算出來的 align.txt 的檔案大小，要 map 的空間就是對應整個檔案大小。
prot：映射區域的保護方式。注意要避免跟 open() 的 mode 衝突
- PROT_EXEC Pages may be executed. 映射區域可被執行
- PROT_READ Pages may be read. 映射區域可被讀取
- PROT_WRITE Pages may be written. 映射區域可被寫入
- PROT_NONE Pages may not be accessed. 映射區域不能存取
- 作業的 code 是給 PROT_READ，純讀取模式。跟先前 open 給定的 O_RDONLY 沒有衝突，符合邏輯。
flags：假如多個 process 都映射到這個 map file，process A 做的更新，是否可以被 process B, C, D… 看見？
- MAP_SHARED：其他 process BCD 可以看的到 process A 做的更新，對應之 map file 檔案也會一併更新
- MAP_PRIVATE：建立一個私人的 copy-on-write mapping。其他 process BCD 看不到 process A 做的更新，對應之 map file 亦不會更新。
- 其他還有多種特殊變形，但是基本的就上面這兩種
- 作業的 code 是給定 MAP_SHARED，個人認為因為是要多個 pthread 去一起完成 append 的任務，所以應該是要用 SHARE 的這個模式。
fd：用系統函式 open 打開的檔案位置，要注意 open() mode 不能跟前面 prot 衝突。在這邊就可以稍微理解為何不用 fopen() 了。還是 fopen() 也能用？不過型態不同。
- 作業的 code 是使用 open 所回傳的 int 型態的檔案位置。
offset：從文件映射開始處的偏移量，通常為0，代表從文件最前方開始映射。offset必須是分頁大小的整數倍，可以用 sysconf(_SC_PAGE_SIZE) 取得。(在32位體系統結構上通常是4K)
- 因為要讀取整個檔案，所以 offset 應為 0。
mmap() 回傳映射區域的起始位置。
- 因為 lastname 資料都直接 map 過來了，相信後續的檔案操作就是直接從 memory 要資料，所以這個回傳的位址空間會用到。

img source2

img source

main.c 的 mmap (再講一次)

char *map = mmap(NULL, fs, PROT_READ, MAP_SHARED, fd, 0);

指定起始位置給 NULL，讓系統直接設定
檔案大小 fs 由之前的 fsize( ALIGN_FILE) 取得，得到 align_file 的檔案大小，看樣子整個 txt 都要做 mapping，offset 應該就會為 0
prot 這邊是設定 PROT_READ，前面的 open() mode 則是設定為 O_RDONLY 兩個都是讀取模式，不衝突。
flags 設定為 MAP_SHARED，應該是因為要分為多個 thread，所處理的結果應該要可以被其他的 process 可以看到。
fd：由前面 open() 所回傳的檔案指標。
offset 設定為 0，因為整個檔案都要讀進去。
最後 *map 就會收到 mmap 所回傳的映射區塊起始位置，這樣在 append 的時候直接去跟某位址的記憶體空間去要資料。

entry_pool

"pool" 有池子的意思，在這邊的用法應該就是把所有讀入的資料都丟到這個池子裏面。而要做 findname 的時候再到這個 entry_pool 找資料。

entry *entry_pool = (entry *) malloc(sizeof(entry) * fs / MAX_LAST_NAME_SIZE);

可以看到上面這段 code，在宣告 *entry_pool 的時候，同時配置了一塊空間，大小 sizeof(entry) * fs / MAX_LAST_NAME_SIZE -> entry 空間大小 * 資料筆數。

pthread

Getting Started With POSIX Threads

事前準備

pthread_setconcurrency：設定 level。

為什麼這邊的 level 要設定成 THREAD_NUM + 1，而不是 THREAD_NUM 呢？怕變成負的？ Hank Chen

建立以 pthread_t 型態的 pthread，這邊是用 malloc 建立的，所以應該可以使用類矩陣的 index
- 突然對 pthread_t 的資料型態有興趣，所以先土法煉鋼找了 /usr/include 的 pthread.h，裏面沒寫，再去看 pthreadtypes.h，看到以下定義：

typedef unsigned long int pthread_t;

append_a 型態 (why double star?)
從 phonebook_opt.h 先觀察 append_a 的資料型態

typedef struct _append_a {
    char *ptr;          // first address of certain pthread on map; map + MAX_LAST_NAME_SIZE * i
    char *eptr;         // last address of map; map + fs
    int tid;            // thread id ( 0, 1, 2, ...)
    int nthread;        // number of threads
    entry *entryStart;  // start address in entry pool
    entry *pHead;
    entry *pLast;
} append_a;

set_append_a 設定 append_a
裡面原來名字叫做 new_append_a 的樣子，但是觀察過他其實是用來設定上述 append_a 資料型態裡面的成員，所以改名字。除了傳入的 argument 都會設定之外，目前 pHead pLast entryStart都是設定成傳入的 entry *start

append_a *set_append_a(char *ptr, char *eptr, int tid, int ntd, entry *start);

不過看他使用的樣子，大概就可以知道為什麼要用 double star 了

    for (int i = 0; i < THREAD_NUM; i++)
        app[i] = set_append_a(map + MAX_LAST_NAME_SIZE * i, map + fs, i, THREAD_NUM, entry_pool + i);

假設我有四個 pthread，所以應該會對應到 app[0] ~ app[3]。觀察後面的 code，其實可以發現，app[0] ~ app[3] 是這些子資料串的「頭」，後面用 pthread 跑的 append 會拿新的資料皆在這些頭後面，一個一個串起來。所以這些 app[x] ptr 是指向 ptr to append_a，所以應該要用 double star。

跑 pthread

`pthread_create()`

建立一個新的 pthread；starts a new thread in the calling process

man page for pthread_create

用法

int pthread_create( &a_thread, a_thread_attribute, (void *) &thread_func, (void *) &argument);
//return 0 when success

在本篇 code 的用法：不指定 pthread_attr 設定 pthread 的屬性，一個 pthread 執行函式 append 的任務，而傳入的參數 app[i] 則是

    for (int i = 0; i < THREAD_NUM; i++)
        pthread_create( &tid[i], NULL, (void *) &append, (void *) app[i]);

`append()`




























void append(void *arg)
{
    struct timespec start, end;
    double cpu_time;

    clock_gettime( CLOCK_REALTIME, &start);

    append_a *app = (append_a *) arg;

    int count = 0;
    entry *j = app->entryStart;
    for (char *i = app->ptr; i < app->eptr;
            i += MAX_LAST_NAME_SIZE * app->nthread,
            j += app->nthread,count++) {
        
        //set the next linked pointer
        app->pLast->pNext = j;
        
        //set "the next pointer" data
        app->pLast = app->pLast->pNext;
        app->pLast->lastName = i;   // i (pointer in map) points to lastname  
        app->pLast->pNext = NULL;
    }
    clock_gettime(CLOCK_REALTIME, &end);
    cpu_time = diff_in_second(start, end);

    pthread_exit(NULL);
}

每個 app[x] 都有每個資料串的專屬資訊。如 thread_id total # of threads 等等，append 這邊比較需要用到的就是 entry 的資訊。起初 entryStart pHead pNext 都是等於 entryStart，也就是 entry_pool 加上特定 offset 的位置。資料也是從這邊開始長
for 迴圈會用到 i跟j 這兩個指針。i 對應的是 map 的位置，每一次讀取完畢之後會跳到 total # of threads 個名字，而 j 是對應到 entry_pool 後的某一個位置。
資料會從 pNext 開始長，每次將 i 位置的字串位置設定到 j entry 裡面的 lastname 位置，這樣 lastname 就設定完成了，不需要再用到 fgets() 也因為他是利用 mmap 的方法，也可以減少檔案 IO 而增加讀取效率。
設定一筆資料完成之後，再到 pLast->pNext，也就是新的 j 位置再設定新的 entry 資料。
最後這個 append 完成之後，呼叫 pthread_exit(NULL); 結束該 pthread

`pthread_join(pthread_t thread, void **retval);`

避免該 pthread 還沒開始執行就結束，pthread_join() 函式會等該 thread 執行完之後才會回傳。
man page for pthread_join

用法：

pthread_join( some_thread, &exit_status );

在本篇 code 的用法：

    for (int i = 0; i < THREAD_NUM; i++)
        //wait until pthread is done
        pthread_join(tid[i], NULL);

子資料串串接








    for (int i = 0; i < THREAD_NUM; i++) {
        if (i == 0) {
            pHead = app[i]->pHead;
        } else {
            etmp->pNext = app[i]->pHead;
        }
        etmp = app[i]->pLast;
    }

在驗證資料正確性的時候，有四個 lastname 是沒有被 append 進去的，是因為在 app[i]->pLast->pNext = app[i+1]->pHead 沒有確實串好的關係。上一個資料串的尾，連接下一個資料串的頭，這樣就可以把分工 append 好的子資料串整個接成一個完整的資料串了。

總結

圖一

圖二

建立 pthread_t 跟 append_a 來設計如何多工。pthread_t 的數量要拆成幾個 thread 完成 append 的工作。而 append_a 的設計影響到每個 thread 如何 append 資料。
使用 setconcurrency pthread_create pthread_join 來讓 pthread 運作。
運作方式寫在 append 裡面，搭配 append_a，其方式是以 entryStart 為開始，同時也是各子資料串的頭 pHead，開始一個一個將 entry 串接起來。pLast 隨著串接的資料多寡逐次往後延伸。也因為事先已經將 lastname 做 alignment，所以在讀取 lastname 資料時，直接將 entry 裡面的 lastname 指向 map 的位置即可，非常的方便迅速。
上述的位置的跳動由 for 迴圈的 i 和 j 去做控制。要注意 i 的型態為 char，offset 需乘上 MAX_LAST_NAME_SIZE。而 j 的型態為 entry，跳躍的 offset 直接給定正整數即可
如圖二所示，pNext 和 pLast即是下一個 pLast 位置。子資料串直接加長
回到圖一，最終會將各子資料串串接起來，藉由上一個資料串的 pLast->pNext = 本資料串的 pHead。

Refactor & 部份修正

我發現 #ifndef OPT 和 #if defined(OPT) 太散了。

例如一開始的讀檔案，未最佳化版本讀檔，不懂為什麼前面宣告完 filestream 之後，中間再塞一個適用於最佳化版本的 struct timespec mid 的宣告，再回來讀檔，有點讓人難以接受。
#ifndef #if defined #else #endif 好像無法用我們習慣的縮排去區分階層區塊，有一點難閱讀。

釋放記憶體空間

main.c
- 發現 int fd = open(ALIGN_FILE, O_RDONLY | O_NONBLOCK); 之後沒有 close(fd)
file.c
- 發現 char *wbuf = (char *) malloc(sizeof(char) * pad); 之後 wbuf 也沒有 free()

效能分析實驗 (陽春版)

我一直懷疑 pthread 真的有作用嗎？所以一開始簡單的在 append printf 目前的 thread number，真的印出來每次都不一樣，例如下面這個，thread 結束的順序都不太一樣，暫時我是信了。

size of entry : 24 bytes
thread # : 4
thread # : 1
thread # : 2
thread # : 3
thread # : 0
thread # : 6
thread # : 5
thread # : 7
execution time of append() : 0.006731 sec
execution time of findName() : 0.005221 sec![](https://i.imgur.com/2VuhaRX.png)

Thread 數目對效率的影響

thread num = 1

thread num = 2

thread num = 4

thread num = 8

重新學 gnuplot 怎麼使用，使用下面的 script 畫圖。參考網址

reset
set ylabel 'time(sec)'

set style data histogram
set style histogram clustered
set style fill solid

set title 'perfomance comparison'
set term png enhanced font 'Verdana,10'
set output 'runtime_compare.png'

set format x "%g"

plot for [COL=2:3] 'compare.txt' using COL:xticlabels(1) title columnheader

得到以下比較圖，發現 thread_num = 4 的時候 append() 效果稍好。

這個作業的重點好像是在 append() time，所以單獨畫出來

thread-pool

ref: tundergod 的 hackmd 筆記

材料閱讀

An Introduction to Lock-Free Programming

很像的翻譯文章，我看不懂原文 google 找到的，可以先看一下之後再回來看原文。
使用 Lock 固然方便，但是 lock/unlock 需要一些成本，且用了之後也有可能產生意想不到的外部成果例如 deadlock

是否為 Lock-free program ?

沒有用 lock 不一定是 lock-free program




while (X == 0)
  {
      X = 1 - X;
  }

Lock-free program 實現方法
- Atomic Operations：atomic operation 不能「做一半」，只能一氣呵成的完成，如原子般不可切割，故稱之。
  - RMW (read-modify-write)
  - CAS (compare-and-swap)
Sequential Consistency：all threads agree on the order in which memory operations occurred, and that order is consistent with the order of operations in the program source code
- Memory Ordering (prevent memory reordering)：Memory Barrier、memory fence instruction、lightweight sync or fence instruction
Avoiding ABA Problem：目前我是以「資料被調包」做理解 source

The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software

CPU 的進展已經從早期的增加 Clock rate 變成 Hyperthreading & Multicore 方向邁進。
The Free Performance Lunch
- 過去增加效能的三個方法
  - clock speed (單位時間增加更多 cycles)
  - execution optimization (單位 cycle 可以做更多的工作) ex.pipeline, branch prediction…
  - cache (減少存取 main memory 的機會)
- 用上述方法，遇到了物理瓶頸：notably heat (too much of it and too hard to dissipate), power consumption (too high), and current leakage problems.
TANSTAAFL: Moore’s Law and the Next Generation(s)
- 新三寶：
  - hyperthreading：若 program/application 有支援，可以讓單一 cpu 執行 2/more threads，最高可以提升 5~40% 的效能，若程式只有 single-thread，hyperthreading cpu 也無用武之地。
  - multicore
  - cache：RAM 還是比 CACHE 慢很多
What This Means For Software: The Next Revolution
- OOP’s strengths in abstraction and dependency management made it a necessity for achieving large-scale software development that is economical, reliable, and repeatable. 物件導向程式的抽象與管理概念在開發大型應用程式的時候好用而開始受到重視成為主流。
- Concurrency is the next major revolution in how we write software. 下一個革命
Benefits and Costs of Concurrency
- Benefits
  - to logically separate naturally independent control flows
  - for performance, either to scalably take advantage of multiple physical CPUs or to easily take advantage of latency in other parts of the application
- Costs：雖然有成本，但若能適當使用，得到的效益會本成本高許多
  - Locks：lock-based programming is hazardous.
  - concurrency is that not all applications are amenable to parallelization. => the model in the programmer’s head that he needs to reason reliably about his program, is much harder than it is for sequential control flow.
What It Means For Us 對我們的啟發
- 為了榨出CPU更多的的效能，Concurrency 是必須的，但是不是每個 application 都適合平行化。
- CPU-bound：指的是佔用龐大的CPU資源的程式，即使 cpu 性能越來越好，而有些程式越來越能把 cpu 的資源耗盡。
- programming languages and systems 被迫開始支援 concurrency 的支援，如 ptheread 和 OpenMP
Conclusion
- Cache 的增長可以改善對一些直線控制的流程，但 throughput 的增加仍要持續利用 concurrency 來最佳化。