2016q3 Homework3 (mergesort-concurrent)

contribute by <kobeyu>

tags: `kobeyu`

開發環境 @c9.io

Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                8
On-line CPU(s) list:   0-7
Thread(s) per core:    2
Core(s) per socket:    4
Socket(s):             1
NUMA node(s):          1
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Stepping:              4
CPU MHz:               2499.998
BogoMIPS:              4999.99
Hypervisor vendor:     KVM
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              30720K
NUMA node0 CPU(s):     0-7

解析程式碼 mersort-concurrent

資料夾結構

.
├── AUTHORS
├── LICENSE
├── Makefile
├── README.md
├── list.c
├── list.h
├── main.c
├── scripts
│   ├── install-git-hooks
│   └── pre-commit.hook
├── threadpool.c
└── threadpool.h

1 directory, 11 files

list.c：宣告了一個link list的資料結構來儲存排序時所需資料,也提供了存取list的功能.
main.c:程式進入點與實作了merge sort的地方.
threadpool:

Makefile

gcc -std=gnu99 -Wall -g -pthread -o list.o -MMD -MF .list.o.d -c list.c
gcc -std=gnu99 -Wall -g -pthread -o threadpool.o -MMD -MF .threadpool.o.d -c threadpool.c
gcc -std=gnu99 -Wall -g -pthread -o main.o -MMD -MF .main.o.d -c main.c
gcc -std=gnu99 -Wall -g -pthread -o sort list.o threadpool.o main.o -rdynamic

觀察Makefile遇到了之前沒看過的旗標,查了一下是將相依的檔案存起來,說明如下

-M 
  生成文件關聯的信息。包含目標文件所依賴的所有源代碼你可以用gcc -M hello.c來測試一下，很簡單。 

-MM 
  和上面的那個一樣，但是它將忽略由#include<file>;造成的依賴關係。 

-MD 
  和-M相同，但是輸出將導入到.d的文件裡面 

-MMD 
  和-MM相同，但是輸出將導入到.d的文件裡面
  
-MF 
 指定輸出到某個檔案,否則預設是原始檔案檔名.d

寫了一個僅#include <stdio.h>的檔案來試試看:
只有-MD的結果：

$ gcc -MD main.c 
$ cat main.d
 hello.o: hello.c
 /usr/include/stdc-predef.h /usr/include/stdio.h \
 /usr/include/features.h /usr/include/x86_64-linux-gnu/sys/cdefs.h \
 /usr/include/x86_64-linux-gnu/bits/wordsize.h \
 /usr/include/x86_64-linux-gnu/gnu/stubs.h \
 /usr/include/x86_64-linux-gnu/gnu/stubs-64.h \
 /usr/lib/gcc/x86_64-linux-gnu/4.8/include/stddef.h \
 /usr/include/x86_64-linux-gnu/bits/types.h \
 /usr/include/x86_64-linux-gnu/bits/typesizes.h /usr/include/libio.h \
 /usr/include/_G_config.h /usr/include/wchar.h \
 /usr/lib/gcc/x86_64-linux-gnu/4.8/include/stdarg.h \
 /usr/include/x86_64-linux-gnu/bits/stdio_lim.h \
 /usr/include/x86_64-linux-gnu/bits/sys_errlist.h

只有-MMD的結果：

$ gcc -MMD main.c 
$ cat main.d
 hello.o: hello.c

觀察本次專案的所有.d檔：

$ cat .list.o.d 
list.o: list.c list.h
$ cat .main.o.d 
main.o: main.c threadpool.h list.h
$ cat .threadpool.o.d 
threadpool.o: threadpool.c threadpool.h

由上述結果推估,應該是要確認main.o是不是有涵蓋了threadpool.h以及list.h,但直覺上不是那麼單純XD
真的沒有這麼單純…我打算把threadpool拆出來寫一個小程式,回頭參考作業中的Makefile,首先發現了有一行怪怪的,關鍵應該再把.d檔案的內容帶到Makefile裡面
找到一篇stackoverflow尚有人討論這個做法link.

注意 -include 的使用方式 jserv

仍未弄懂的地方是"$<"

$< 是自動變數，代表編譯遇到的第一個dependency
kevinbird61





OBJS = list.o threadpool.o main.o
...
deps := $(OBJS:%.o=.%.o.d)
%.o: %.c
	$(CC) $(CFLAGS) -o $@ -MMD -MF .$@.d -c $<

-rdynamic

Pass the flag -export-dynamic to the ELF linker, on targets that support it. This instructs the linker to add all symbols, not only used ones, to the dynamic symbol table. This option is needed for some uses of dlopen or to allow obtaining backtraces from within a program.

查到了資料還不是很懂用意與功能,似乎是要debug用? 囧…

提示：搭配 mutrace 使用。你需要對 C 語言程式的動態執行連結機制有概念，請見你所不知道的C語言：動態連結器篇 jserv

list.[ch]

intptr_t
又看到了一個沒看過的資料型態,查到了原始宣告,再往前追到是要儲存scanf("%ld", &data);輸入的資料,之前看到的資料long在不同平台會有不同的長度,所以推估應該是為了相容不同平台所設計.














/* Types for `void *' pointers.  */
#if __WORDSIZE == 64
# ifndef __intptr_t_defined
typedef long int		intptr_t;
#  define __intptr_t_defined
# endif
typedef unsigned long int	uintptr_t;
#else
# ifndef __intptr_t_defined
typedef int			intptr_t;
#  define __intptr_t_defined
# endif
typedef unsigned int		uintptr_t;
#endif

在list_print的function確認排序後的資料都是遞減的順序


assert((cur->data < cur->next->data) &&
       "sorted data should be t decreasing order");

=>想到可以增加一個延伸功能,讓使用者決定排序的順序是要遞增還是遞減

這不該讓使用者「決定」，應該是自動測試系統的涵蓋範疇之一 jserv

threadpool.[ch]

task_t

看到task_free()裡面的第一行"free(the_task->arg);",表示arg一定要用struct封裝,然後用malloc產生,不能用區域或全域取址的方式帶入,一開始是打算這樣帶一個正整數 tmp_task->arg = (void*)1,此方法是行不通的.

tqueue_t

儲存開發者透過tqueue_push()進queue的資料

tpool_t

worker thread的管理員

example github

為了理解threadpool的用法寫了一個列印字串的小程式















































#include <stdlib.h>
#include <unistd.h>
#include <stdio.h> 
#include <sys/syscall.h>

#include "threadpool.h"

static tpool_t *gPool = NULL;

void task_a(void *data)
{
    printf("process id: %d, with funct:%s\n",
           syscall(SYS_gettid), __func__);
}

void *worker(void *data)
{
    task_t *_task;
    while (1) {
        _task = tqueue_pop(gPool->queue);
            if(_task) {
                _task->func(_task->arg);
                free(_task);
            }
    }
    return NULL;
}

const int thread_count = 3;
int main()
{
    
    gPool = (tpool_t*)malloc(sizeof(tpool_t));
    tpool_init(gPool,thread_count,worker);
 
    for(int i=0;i<1000;i++) {
        usleep(100000);
        printf("create new task\n");
        task_t *_task = (task_t*)malloc(sizeof(task_t));
        _task->func = task_a;
        _task->arg = NULL;
        tqueue_push(gPool->queue,_task);
    }
        
    tpool_free(gPool);
	return 0;
}

在完成這個簡單的範例程式後,有幾個心得
1.threadpool可以封裝的更好,讓其他的程式開發者更容易使用,像是在我範例中的worker()是在做task的提取與執行,使用threadpool的人應該可以不用管到這個funciton,所以可以封裝在threadpool.c裡面.

2.一開始讀code搞不清楚tqueue_t與tpool_t的關係,現在的理解是queue是儲存task的容器,pool管理執行工作的thread,可以想成是worker,根據上述的理解,我會重新命名物件tqueue_t改成ttask_queue_t / tpool_t 改成 tworker_pool_t 增加易讀性.

3.命名方式,在這次的專案llist_t的第一個l與tqueue_t/tpool_t的t應該都是前綴,但在同一個檔案內的其他物件並沒有照這樣的命名方式,沒有統一的命名規則有點壞味道XD,另外不是很清楚c語言或linux kernel在全域變數的命名習慣,所以在範例中先參考匈牙利命名法的方式以g開頭作為全域變數的命名規則.

在Linux kernel coding sytle中描述全域變數的原則如下：

GLOBAL variables (to be used only if you _really_ need them) need to
have descriptive names, as do global functions.  If you have a function
that counts the number of active users, you should call that
"count_active_users()" or similar, you should _not_ call it "cntusr()".

簡單來說除非你"真的"需要使用全域變數(個人解讀為原則上不要用),在命名上就需要描述它,便免用使用簡寫
好的示範： count_active_users()"
不好的示範： cntusr()

不要用匈牙利命名！請詳閱 Linux kernel coding style jserv
已修正,並附上文件中關於global variable的描述 kobeyu

效能分析

a. 4 thread, 50筆資料(100遞減到50) 執行50次分別的執行時間如下圖:

b. 4 thread, 100筆資料(100遞減到0) 執行50次分別的執行時間如下圖:

從上面的執行結果可以觀察到一件不尋常的現象,一樣的演算法輸入相同的資料,相同的執行序數量,理論上執行的時間應該要接近才是,實際的結果反而是有蠻明顯的變異.
經過了一些時間思考,我假設是因為使用mutex而產生了race condition的現象,造成執行時間有明顯的變異,所以寫了一個小程式(kythreads)來驗證,使用mutrace來觀察mutex變化是否跟執行時間有關.
實驗結果:
初步的實驗看不出明顯關聯,正在思考哪個部份有問題…

lock free

lock free其中一種實做方式是透過CAS(Compare and swap)演算法,CAS演進的過程與原理可以參考這篇cse378,簡單來說是比較某個記憶體位址的值是否等於old value,如果是則用一個new value更新這個記憶體位置,如果否則要考慮error handling,底層的實做方式是透過atom operation,由於atom operation是直接由cpu所執行,所以可以在一次的執行時間中完成load跟store,而且過程不會被中斷.
參考老師的專案concurrent-ll是使用gcc所提供的atom operation "__sync_val_compare_and_swap(a, b, c)",一樣先在先前的小程式github驗證lock-free是否能夠有正確的輸出以及變異比較小的執行時間,然後在試著把merge sort-concurrent改為lock free的版本.

實驗結果不如預期,思考中…

讀到一篇文章Mainz's blog說道透過CAS實做的lock free並非沒有成本,主要的時間成本會在L1 Cache中搜尋要存取的資料,但比起mutex需要contex switch的成本要小的多.

Bugs

1.如果輸入非數字的資料會出錯.

$ ./sort 6 4
input unsorted data line-by-line
a

sorted results:
[140720737361392] [140720737361392] [140720737361392] [140720737361392]

=>想到可以增加的功能:可排序數字與字母

程式碼裡頭有 FIXME，提醒你避免使用 scanf，應該設計自動測試的機制，就像 concurrent-ll 一般。 jserv

synchronization object

manager-worker 架構

mutex contention - mutrace

進化版

lock-free

C11本身的thread.h實作thread

參考資料

-rdynamic
- gcc文件
- Logan's Blog
intptr_t
- github mirror
- Anker's blog
64-bit data type: long
- Wikipedia
- dada's blog
concurrent
- Mainz's blog
- cse378

"wiki" 和 "Wikipedia" 兩者不同，不要混用，而且後者開頭要「大寫」 jserv

已修正 kobeyu