contributed by < tina0405
>, < heathcliffYang
>
typedef intptr_t val_t;
typedef struct node {
val_t data; /**< Data of the node */
struct node *next; /**< Pointer to the next node */
} node_t;
其中在 stdint.h
裡面的定義
/* Types for `void *' pointers. */
#if __WORDSIZE == 64
# ifndef __intptr_t_defined
typedef long int intptr_t;
# define __intptr_t_defined
# endif
typedef unsigned long int uintptr_t;
#else
# ifndef __intptr_t_defined
typedef int intptr_t;
# define __intptr_t_defined
# endif
typedef unsigned int uintptr_t;
#endif
static uint32_t build_list_from_file(llist_t *_list, const char *filename)
{
FILE *fp = fopen(filename, "r");
char buffer[16];
while (fgets(buffer, 16, fp) != NULL) {
char *name = (char*)malloc(16);
strcpy(name, buffer);
list_add(_list, (val_t)name);
}
fclose(fp);
return _list->size;
}
char * -> val_t ??
pointer & string
#include <stdlib.h>
void *malloc(size_t size);
malloc 說明提到 : The malloc() function allocates size bytes and returns a pointer to the allocated memory. The memory is not initialized. 剛好 char 為 1 byte,在 phonebook 作業中,最大的 size 為 16
void list_print(const llist_t * const list)
{
const node_t *cur = list->head;
while (cur) {
xprint((char*)cur->data); // adjust to character and disable new line
cur = cur->next;
}
}
llist_t *small;
int tmp = strcmp((char*)(a->head->data), (char*)(b->head->data));
small = (llist_t*) ((intptr_t) a * (tmp <= 0) + (intptr_t) b * (tmp > 0));
強制轉換
參考了 0xff07 的作業 & github,可以用很簡單的方法、改的部分很少就達到目的真的很厲害!!
# String version test
genVerify:
sort test_data/input.txt > test_data/result_from_command.txt
check_string:
./sort $(THREADS) test_data/input.txt > test_data/result.txt
diff test_data/result.txt test_data/result_from_command.txt
clean_data:
rm -f test_data/input.txt test_data/result.txt test_data/result_from_command.txt
$ (for i in {1..8}; do echo $RANDOM; done) > test_data/test.txt
計畫將設計一個小程式,可以運用機率與統計所學以及更多延伸的資料,來測試拿到的資料的特性,例如是否是均勻分布的亂數排列,夠不夠「亂」? 針對這個問題,我想到了 Randomness test。
找到了一個網頁 Test for random number generators 提供檢測 RNGs,還有一個 lecture 是 Raj Jain 的 Testing Random Testing RandomNumber Generators,但正確性有待查證。
Standard deviation
先從比較簡單的著手,先分析它的 standard deviation (variance 的平方根)
但是在此次作業的案例來說,已知範圍內每個數字都只出現一次,不適用這個檢測方法
Spectral test
要先了解 LCGs 是什麼
//input.txt
35
66
78
45
55
96
85
22
10
//terminal
tina@tina-X550VB:~/mergesort-concurrent$ ./sort 4 input.txt
#Total_tasks_consumed: 12
#Elapsed_time: 0.405 ms
#Throughput: 29629 (per sec)
10
22
35
45
55
66
78
85
96
這裡的例子是->
運行命令為:
hello.exe hello world
那麼, argc 的值是 3, argv[0] 是 "hello.exe" ,argv[1]是 "hello" , argv[2] 是 "world" 。
#define USAGE "usage: ./sort [thread_count] [input_file]\n"
#define MIN(a, b) (((a) < (b)) ? (a) : (b))
...
int main(int argc, char const *argv[])
{
if (argc < 3) {
printf(USAGE);
return -1;
}
thread_count = atoi(argv[1]);
/* Read data */
the_list = list_new();
data_count = build_list_from_file(the_list, argv[2]);
max_cut = MIN( thread_count, data_count);
...
參考pthread.h解說。
thread 的 mutex 型態是 pthread_mutex_t.
在使用前, 記得要初始化:
對於靜態分配可設置為PTHREAD_MUTEX_INITIALIZER,或呼 pthread_mutex_init,回傳 0 表示此 mutex 被使用(成功).
對於動態分配的互斥量, 在申請內存 (malloc) 之後, 通過 pthread_mutex_init 進行初始化, 並且在釋放內存 (free) 前需要呼叫 pthread_mutex_destroy.
int pthread_mutex_init(pthread_mutex_t *mutex, const pthread_mutexattr_t *attr);
int pthread_mutex_destroy(pthread_mutex_t *mutex);
pthread_mutex_t mutex = PTHREAD_MUTEX_INITIALIZER;
pthread_mutex_init(&(data_context.mutex), NULL);
tmp_list = NULL;
pool = (tpool_t *) malloc(sizeof(tpool_t));
tpool_init(pool, thread_count, task_run);
malloc 好一個空間後開始 tpool_init
其中 task_run :
static void *task_run(void *data __attribute__ ((__unused__)))
{
while (1) {
task_t *_task = tqueue_pop(pool->queue);
if (_task) {
if (!_task->func) {
tqueue_push(pool->queue, _task);
break;
} else {
_task->func(_task->arg);
free(_task);
}
}
}
pthread_exit(NULL);
}
先說一下 list, merge_sort, threadpool 等等這些檔案應該要有個別其他資料夾來放置,在找檔案時會比較好看、好找。
不知道是一開始設定的問題,還是程式是用 recursive 的方法做的, call graph 的情況跟我預想的不太一樣,這需要去找 gprof 的運作方式來查證。
此圖為 4 個 thread
這可以看出 tqueue_pop 被叫用的次數非常多,一直在 task_run 的詢問,效率就會很差。
這部分是 merge sort 真正實作的方法,少少的三個函式,其中 merge_sort 是最上層的包裝,由它叫用 merge sort 的第一份工作 split_n_merge : 先用 recursive 的方法去分割 list 直到 只剩 1 個 node ;再執行第二份工作 sort_n_merge : 從最小的 list 一路 sort 上來。
在 task 的分配當中,主要是依照 thread 的數目分 task 並 push 到 queue 裡面,以及最後 merge 各 thread 之間的 list 也會是 queue 裡的 task ,但其實這樣的分配下, queue 裡的 task 不是很多,沒有完全發揮到 task_run 機制。
Lock-free linkedlist implementation of Harris' algorithm
"A Pragmatic Implementation of Non-Blocking Linked Lists"
T. Harris, p. 300-314, DISC 2001.
原本的數字串如下:
如果今天想要 insert Node_20 , 我們會這樣做:
但如果今天 insert Node_20 和 delete Node_10 同時發生,就有可能出現以下情況:
但我們不樂見,因為我們希望他會走訪 Node_20
文中提到的解決方法就是:
We say that a node is logically deleted after the first stage
and that it is physically deleted after the second.
typedef struct {
node_t *head; /**< The head of the linked list */
uint32_t size; /**< The size of the linked list */
node_t *tail;
} llist_t;
the_list->head = new_node(INT_MIN, NULL);
the_list->tail = new_node(INT_MAX, NULL);
the_list->head->next = the_list->tail;
the_list->size = 0;
new_node(val_t val, node_t *next);
NULL should be INT_MIN & INT_MAX which are built in constants.
Otherwise, data type will be not consistent
感謝原來如此,已改掉。 Tina
int list_add(llist_t *the_list, val_t val)
{
node_t *right, *left;
right = left = NULL;
node_t *new_elem = new_node(val, NULL);
while (1) {
right = list_search(the_list, val, &left);
if (right != the_list->tail && right->data == val) {
return 0;
}
new_elem->next = right;
if (CAS_PTR(&(left->next), right, new_elem) == right) {
FAI_U32(&(the_list->size));
return 1;
}
}
}
#define CAS_PTR(a, b, c) __sync_val_compare_and_swap(a, b, c)
#define FAI_U32(a) __sync_fetch_and_add(a, 1)
順便看了一篇文章:
Multithreaded simple data type access and atomic variables
文章裡剛好介紹了一些我不太懂的事:
就是我看了一些人寫的程式,他們都說用了gcc內建的 atomic function,就可以使用出無鎖(lock-free)的程式,而 concurrent-ll 中 lock-free 也的確有下列的使用:
// Compare-and-swap
#define CAS_PTR(a, b, c) __sync_val_compare_and_swap(a, b, c)
#define CAS_U8(a, b, c) __sync_val_compare_and_swap(a, b, c)
#define CAS_U16(a, b, c) __sync_val_compare_and_swap(a, b, c)
#define CAS_U32(a, b, c) __sync_val_compare_and_swap(a, b, c)
#define CAS_U64(a, b, c) __sync_val_compare_and_swap(a, b, c)
// Fetch-and-increment
#define FAI_U8(a) __sync_fetch_and_add(a, 1)
#define FAI_U16(a) __sync_fetch_and_add(a, 1)
#define FAI_U32(a) __sync_fetch_and_add(a, 1)
#define FAI_U64(a) __sync_fetch_and_add(a, 1)
// Fetch-and-decrement
#define FAD_U8(a) __sync_fetch_and_sub(a, 1)
#define FAD_U16(a) __sync_fetch_and_sub(a, 1)
#define FAD_U32(a) __sync_fetch_and_sub(a, 1)
#define FAD_U64(a) __sync_fetch_and_sub(a, 1)
// Increment-and-fetch
#define IAF_U8(a) __sync_add_and_fetch(a, 1)
#define IAF_U16(a) __sync_add_and_fetch(a, 1)
#define IAF_U32(a) __sync_add_and_fetch(a, 1)
#define IAF_U64(a) __sync_add_and_fetch(a, 1)
// Decrement-and-fetch
#define DAF_U8(a) __sync_sub_and_fetch(a, 1)
#define DAF_U16(a) __sync_sub_and_fetch(a, 1)
#define DAF_U32(a) __sync_sub_and_fetch(a, 1)
#define DAF_U64(a) __sync_sub_and_fetch(a, 1)
FSB stands for Front Serial Bus. This is the bus that processor use
to communicate with RAM. I.e. locking FSB will prevent from any
other processor (core), and process running on that processor, from
accessing RAM. And this is exactly what we need to implement atomic v
ariables.
我知道這些後,我想下一步就是把我 main 裏面的執行換成 atomic function,然後就可以改掉那些 mutex 的資源保護,而不是只有在 list.h 裡改那些函式 ,因為我們不只有加減法還有邏輯運算元和轉換可以使用。
附上這次實驗結果,還要繼續改 main.c 檔
跟測原檔的 input.txt 是同一個
//lock-free
tina@tina-X550VB:~/mergesort-concurrent$ ./sort 4 input.txt
#Total_tasks_consumed: 12
#Elapsed_time: 0.302 ms
#Throughput: 39735 (per sec)
10
12
22
35
45
55
66
78
85
96
//original
tina@tina-X550VB:~/mergesort-concurrent$ ./sort 4 input.txt
#Total_tasks_consumed: 12
#Elapsed_time: 0.405 ms
#Throughput: 29629 (per sec)
10
22
35
45
55
66
78
85
96
改進的方向有兩個:
自從加了 lock-free 裡的函式後發現 make check 有問題,但上面跑的結果是對的,因此就想到是不是數據不夠多。
馬上來利用 make check 裡有附的亂數產生工具
tina@tina-X550VB:~/mergesort-concurrent$ bash scripts/gen-random-numbers.sh 1024 in.txt
tina@tina-X550VB:~/mergesort-concurrent$sort -g in.txt > out.txt
tina@tina-X550VB:~/mergesort-concurrent$./sort 4 in.txt | tail -n +4 > result.txt
tina@tina-X550VB:~/mergesort-concurrent$bash scripts/compare.sh out.txt result.txt
EXAMPLE:
//out.txt
1
2
2
3
//result.txt
1
2
3
if (right != the_list->tail && right->data == val) {
return 0;
}
來測試之前給的資料 mutrace
在終端機 git clone git://git.0pointer.net/mutrace.git
README檔上說:
mutrace profiles lock contention for you. Just use it as prefix for your usual
command line and it will profile mutexes used in all child processes.
Example:
mutrace gedit
所以我就將 mutrace 放置偵測檔案的前頭,不知道有沒有理解錯誤。
使用方法: 將 mutrace 放置 command line 前方
tina@tina-X550VB:~/mergesort-concurrent$ mutrace ./sort 4 in.txt
mutrace: Showing statistics for process sort (pid 9173).
mutrace: 3 mutexes used.
Mutex #2 (0x0x1eb4670) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fbea009b4b2]
./sort(tqueue_init+0x46) [0x40150b]
./sort(tpool_init+0x6a) [0x40176f]
./sort(main+0xe5) [0x401f10]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fbe9fad2830]
Mutex #0 (0x0x7fbe9d6a1860) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_lock+0x49) [0x7fbea009b6b9]
/lib/x86_64-linux-gnu/libgcc_s.so.1(_Unwind_Find_FDE+0x2c) [0x7fbe9d49dfec]
[(nil)]
Mutex #1 (0x0x603140) first referenced by:
/usr/lib/mutrace/libmutrace.so(pthread_mutex_init+0xf2) [0x7fbea009b4b2]
./sort(main+0xab) [0x401ed6]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf0) [0x7fbe9fad2830]
mutrace: Showing 3 most contended mutexes:
Mutex # Locked Changed Cont. tot.Time[ms] avg.Time[ms] max.Time[ms] Flags
2 143 29 22 0.039 0.000 0.001 Mx.--.
0 20 7 2 0.007 0.000 0.001 M-.--.
1 6 4 0 0.002 0.000 0.000 M-.--.
||||||
/|||||
Object: M = Mutex, W = RWLock /||||
State: x = dead, ! = inconsistent /|||
Use: R = used in realtime thread /||
Mutex Type: r = RECURSIVE, e = ERRRORCHECK, a = ADAPTIVE /|
Mutex Protocol: i = INHERIT, p = PROTECT /
RWLock Kind: r = PREFER_READER, w = PREFER_WRITER, W = PREFER_WRITER_NONREC
mutrace: Note that the flags column R is only valid in --track-rt mode!
mutrace: Total runtime is 5.421 ms.
mutrace: Results for SMP with 4 processors.
is only valid in --track-rt mode!
以下與 low-order mark bit (logically deleted (1) or not (0))相關的部分,包含
判斷是否 marked
Reference 中的文件提到關於 is_marked_ref : returns true if and only if r is a marked reference.
int is_marked_ref(long i)
{
return (int)(i & 0x1L);
}
前面部份必會被 mask 為 0 ,只剩下 mark bit ,值則由 i 的 LSB 決定
mark 與 unmark
i &= ~0x1L;
AND (111…0), 代表前面的部分會維持 i 值原來的樣子, LSB 必設定為 0
i |= 0x1L;
OR (000…1), 代表前面部分會維持 i 值原來的樣子, LSB 必設定為 1
取得目標 marked 或 unmarked 的狀態
0x1L ,這個數字的意思是型態為 long integer (4 bytes) 的 hexadecimal 的數字 1
list_search 的部分, *left_node 和 left_node_next 只會記錄 unmarked 的 t_next,也就是說, *left_node 的 next (也就是 left_node_next) 永遠不會是指向被 mark 起來的 node
而 search 的結果只會回傳確定 right_node_next 不是 marked 的 right_node
CAS_PTR(&((*left_node)->next), left_node_next, right_node);
list_add
FAI_U32(&(the_list->size));
這個的意思是 size 加 1
list_remove : logically delete 但在 code 中看不出來哪一個部分有寫到 set_mark
list_delete : empty -> not for now
include/atomic_ops_if.h
_sync*() 在 gcc 的線上文件 中有提到是 atomic memory access 的內建函式。但要注意的是並不是每個處理器都支援這些 operations 。
這些函式大部分都是 full barrier ,也就是說,no memory operand will be moved across the operation, either forward or backward,Further, instructions will be issued as necessary to prevent the processor from speculating loads across the operation and from queuing stores after the operation.
FAI_U32(a)
則為 __sync_fetch_and_add(a, 1)
{ tmp = *ptr; *ptr op= value; return tmp; }
{ tmp = *ptr; *ptr = ~tmp & value; return tmp; } // nand
CAS_PTR(a,b,c)
則為 __sync_val_compare_and_swap (type *ptr, type oldval type newval, ...)
src/lockfree/main.c
主要是 main() 裡面先初始化各個需要用到的 object 及變數,再來解讀 command-line 並設定各項參數,接下來是真正要開始執行以及計算 operations
// counts the number of operations each thread performs
unsigned long num_operations;
include/utils.h
在這個網頁的解釋 :
This attribute specifies a minimum alignment (in bytes) for variables of the specified type.
不過需要特別注意的是,aligned attribute 的大小會被 linker 限制 (maximum)
ALIGNED(N) __attribute__((aligned(N)))
test-lock (src/lock/main.c)
Thread 0
#operations : 253365
#inserts : 12882
#removes : 12881
Duration : 1000 (ms)
#txs : 253365 (253365.000000 / s)
Expected size: 1024 Actual size: 1024
從 Makefile 來看,make bench
可以叫用 benchmarking scripts
bench: $(EXEC)
bash scripts/run_ll.sh
bash scripts/create_plots_ll.sh >/dev/null
@echo Check the plots generated in directory 'out/plots'.
從 README.md 裡面看到 Tools 的相關描述,發現有許多可以測試實作的 script,下面是一些閱讀理解。
run_ll.sh
initials="128 1024 8192";
updates="0 10 50"
out="$out_dir/ll.i$initial.u$update.dat"
./scripts/scalability2.sh "$cores" \
./out/test-lock ./out/test-lockfree \
-d$duration -i$initial -r$range -u$update | tee $out;
scalability2.sh
原本的範例用法為 scalability2.sh all ./out/test-lock ./out/test-lockfree -i100
原本 make bench
的輸出
*** Using as a default number of cores: 4 on 1 socket
***
bash scripts/run_ll.sh
* -i128
** -u0
## Use 4 as the number of cores.
## If not correct, change it in scripts/config
# ./out/test-lock ./out/test-lockfree
#cores throughput %linear scalability throughput %linear scalability
1 2035637 100.00 1 6648282 100.00 1
2 1255928 30.85 0.62 13058967 98.21 1.96
3 1560506 25.55 0.77 15337146 76.90 2.31
4 1774850 21.80 0.87 16416128 61.73 2.47
thr1a=$($run_script ./$prog $params -n1 | grep "#txs" | cut -d'(' -f2 | cut -d. -f1);
operations * 1000.0 / duration
數字,並且是只取整數部分;其中, cut 是擷取輸出的指令, -d 的意思是以 " ( " 為分隔, -f2 是取分隔後的第二個部分$thr/$thr1
100*(1-(($c-$scl)/$c))
test_correctness.sh
先執行 scripts 裡面的 lock_exec & config
再執行 ./out/test-lock & ./out/lockfree
看不太懂 lock_exec 裡面的 /tmp/lock_exec_$(whoami) 是指什麼?
scalability1.sh :
原本的範例用法為 scripts/scalability1.sh all ./out/test-lock -i128
create_plots_ll.sh
trap : 一個 shell 內建的 function,可以定義 handlers 以回應收到的硬體訊號或其他事件。
trap [-lp] [[ARG] SIGNAL_SPEC...]
common signals : 這個是 Sending and Trapping Signals的介紹
Name | Number | Meaning |
---|---|---|
HUP | 1 | Hang Up. The controlling terminal has gone away. |
INT | 2 | Interrupt. The user has pressed the interrupt key (usually Ctrl-C or DEL). |
QUIT | 3 | Quit. The user has pressed the quit key (usually Ctrl-). Exit and dump core. |
KILL | 9 | Kill. This signal cannot be caught or ignored. Unconditionally fatal. No cleanup possible. |
TERM | 15 | Terminate. This is the default signal sent by the kill command. |
EXIT | 0 | Not really a signal. In a shell script, an EXIT trap is run on any exit, signalled or not. |
shell scripts :
bc : An arbitrary precision calculator language
c 裡面 constant 的解釋
It may have a prefix that specifies its base and a suffix that specifies its type.
Call graph command
sudo apt install graphviz
gprof ./sort | gprof2dot | dot -T png -o out/plot/gprof1.png
Github: LockLess
這篇文章主要是在改善時代回收演算法的效能,時代演算法是垃圾回收演算法的一種,在實作 lock free